← All posts
Founder Playbooks & Case Studies

Case Study: From AI Proof-of-Concept to Production in Six Weeks

A real engagement (anonymized) where a 4-person startup had a working AI prototype but no path to production. Here's the six-week plan that got them shipped — and the trade-offs we made.

Craig Hoffmeyer8 min read

The most common conversation I have with seed-stage founders building AI features is some version of: "We have a working prototype. The demo is great. We have no idea how to turn it into a production product." The gap between a working notebook and a paying customer is bigger than founders expect, and most teams stall in it for months before either shipping a fragile half-baked version or abandoning the feature entirely.

This case study is one engagement where it went well. A 4-person startup with a strong AI prototype and a six-week deadline. The deadline was real — they had committed it to their first three pilot customers in writing. I came in at week zero. They shipped on day 41. This is what happened.

The starting point

The company built tooling for a specific operational role inside mid-market companies. The AI feature in question was a workflow automation that took an unstructured input (a long form, basically) and produced a structured output (a multi-field action plan). The prototype lived in a single Python notebook, used a frontier LLM, and worked beautifully on the 12 examples the founder had hand-curated.

When I joined, the situation was:

  • The prototype: working in a notebook on the founder's laptop. Took 30–90 seconds per run. No web interface. No persistence. No auth.
  • The team: 2 full-stack engineers, 1 ML-leaning engineer who had built the prototype, 1 designer-founder.
  • The deadline: 6 weeks until the first pilot customer demo. 8 weeks until the first pilot customer was using it in production.
  • The budget: roughly $40,000 for any external help. This was the reason I was a fractional CTO and not a full team.
  • The unknowns: quality on real customer data, latency at scale, cost per request, and what to do when the model failed (which it did, on roughly 20% of inputs in the prototype).

The plan

We had one planning session at the end of day 1. The plan I wrote was deliberately small and concrete. Six weeks, six themes, one per week. We did not call them sprints. We did not set up a Jira board. We used a single shared document with a status line.

Week 1: Eval set and quality baseline. Before we wrote a single line of production code, we needed to know what "good" meant. Building an eval set was the first job.

Week 2: Production-shaped backend. A real API, real persistence, real auth, deployed to a real environment.

Week 3: Frontend integration and end-to-end flow. A web UI that actually let a user do the thing.

Week 4: Quality work — the hard part. Improving the success rate on the eval set from baseline to acceptable.

Week 5: Hardening, observability, and cost controls. The unsexy week.

Week 6: Pilot customer onboarding. Real users on real data.

I want to be specific about why the order matters. Most teams in this situation start with the frontend. They build a beautiful interface, integrate the prototype as-is, and then realize the quality is terrible and the cost is impossible. We started with the eval and the backend because they were the things that would actually determine whether the product was viable.

Week 1: Eval set

The ML engineer and I spent three days building an eval set from 50 anonymized real customer documents. The founder pulled them from existing customer relationships under appropriate terms. We hand-graded the prototype's output on each one across four dimensions: completeness, correctness, formatting, and "would a customer trust this."

The baseline was ugly. The prototype that had looked great on 12 cherry-picked examples was scoring around 62% overall on the broader set. The "would a customer trust this" dimension specifically was at 48%. We could not ship that.

But now we had a number. Every change we made for the next five weeks had a clear test: did the eval score go up or down? This was the most important artifact of the whole engagement.

Week 2: Production-shaped backend

While the ML engineer kept working on prompt iterations against the eval, the two full-stack engineers built the production backend. Specifically:

  • A real API with the existing prototype as a thin endpoint.
  • Postgres for persistence (per the boring stack).
  • Auth via Clerk.
  • Hosting on Fly.
  • Background job queue for the long-running LLM calls.
  • Feature flags so we could roll forward and back without redeploying.

This took 4 days of focused work and was deliberately boring. No new technologies, no clever architecture. The goal was to stop running the product on the founder's laptop, not to build anything novel.

Week 3: Frontend and end-to-end

The third week was the first time the product looked like a product. The two engineers built a simple frontend that walked a user through uploading the input, watching the processing, reviewing the output, and exporting the result. It was ugly. We did not care. It was real.

We did our first internal end-to-end test on day 19. It was painful. About a third of the test runs hit some bug — usually a backend timeout, sometimes a frontend rendering issue, occasionally a real LLM failure. We fixed the obvious ones immediately and added the LLM failures to the eval set as new test cases.

Week 4: Quality work

This was the hardest week. The ML engineer and I spent the entire week iterating on the prompt, the context assembly, and the model selection. We made about 30 small changes, ran the eval after every batch of changes, and kept the ones that improved the score.

The biggest single improvement came from changing how we structured the context. We had been giving the model the entire input document as a single blob. We changed it to a structured representation with clear section headers. The eval score jumped from 71% to 84% on that change alone.

The second biggest improvement was switching to a two-pass model: a cheaper model did a first-pass extraction, then the frontier model reviewed and corrected the first-pass output. This improved both the quality (89% by the end of the week) and the cost (about a 40% reduction per request).

By the end of week 4, the "would a customer trust this" dimension was at 87%. Not perfect, but in the range where we were comfortable putting it in front of a friendly pilot customer.

Week 5: Hardening

Week 5 was the unsexy work that determines whether the product survives contact with real users. We did all of it:

  • Sentry error tracking on the frontend and backend.
  • Per-feature cost dashboard, tagged by customer.
  • Hard caps on model iteration and token counts.
  • A retry strategy with sane backoff.
  • Soft and hard rate limits per customer.
  • A "human in the loop" review queue for outputs the model was unsure about (we tagged outputs with a confidence score and routed low-confidence ones to a queue the founder could review).
  • A simple admin panel for the founder to inspect any customer's runs and override if needed.

This week is the week most teams skip and then regret. I insisted on it because I have seen what happens to teams who ship without these things. The first time a customer reports an issue you cannot diagnose, you wish you had spent the week.

Week 6: Pilot customers

Day 36, we onboarded the first pilot customer. The founder walked them through it personally over a 90-minute call. The customer ran their first real document through the system live. It worked. The output was good enough that the customer asked if they could use it on three more documents that day.

By day 41, all three pilot customers were onboarded. The system had processed about 200 real documents. The eval score was holding around 88%. The cost per request was about $0.80, well within the founder's target. The "human in the loop" queue was getting maybe 8% of runs flagged for review, which the founder could handle in 20 minutes per day.

We shipped on time. Not perfectly, but on time and with the right things in place to learn from real users.

The trade-offs we made

Six weeks is not enough time to build everything. The list of things we deliberately did not do is as important as what we did:

  • No fancy UI. The frontend was functional, not beautiful. Design could come later.
  • No mobile. Web only.
  • No multi-tenancy beyond the basics. Each pilot customer had their own data, but the architecture was not yet ready for self-serve.
  • No SOC 2. We knew it would be needed eventually but it was not blocking the pilot.
  • No analytics dashboard for customers. They got results; they did not get usage data.
  • No batch upload. One document at a time was fine for the pilot.
  • Minimal admin tooling for the team. Just enough to inspect and override; no fancy reporting.

Every one of those trade-offs was discussed and recorded in the plan document. Nothing was forgotten — it was deliberately deferred. That distinction matters because the team knew exactly what was waiting for them after the pilot.

The lessons

Three things from this engagement that I now use as defaults on every similar one.

Lesson 1: Build the eval first, always. A team without an eval is a team flying blind on quality. Building the eval before the production code felt slow at the time and turned out to be the single highest-leverage decision of the engagement. Every quality change had a clear test.

Lesson 2: Boring stack, fancy feature. The only place we used anything novel was the AI layer itself. Everything else — auth, hosting, database, UI framework — was the most boring possible choice. This freed the team to spend their attention on the part that mattered.

Lesson 3: Hardening week is non-negotiable. Week 5 was the week the team most wanted to skip and the week that paid for itself within 48 hours of pilot launch. Observability, cost controls, and a human-in-the-loop fallback are the difference between a pilot that succeeds and one that becomes a chaos that consumes the founder's time.

Counterpoint: this only works with the right team

The team made this possible. Three engineers willing to ship boring code on a deadline, an ML-leaning engineer who could iterate on prompts, and a founder who could make trade-off decisions in real time. If any of those had been missing, six weeks would not have been enough.

If you have a smaller team or weaker fits, the path is not "compress the same plan into less time." It is "extend the timeline" or "narrow the scope." Pretending you can ship a six-week plan in three weeks with two engineers is a great way to ship something fragile and lose the pilot customer.

Your next step

If you are stuck between a working AI prototype and a production product, try this exercise. Write a one-page plan organized as six weekly themes, in the order I used: eval, backend, end-to-end, quality, hardening, pilot. For each week, list the specific deliverable. If any week feels impossible with your current team, that is the conversation to have before you commit a deadline to a customer.

Where I come in

This kind of focused, deadline-driven engagement is one of the most fun things I do as a fractional CTO. Six weeks, a clear deliverable, a small team I work alongside, and a real production launch at the end. Book a call if you are sitting on an AI prototype that you cannot quite get into production.


Related reading: Build vs. Buy vs. Wrap: A Founder's Framework for AI Features · The Seed-Stage Stack · The Hidden Cost Curve of LLM Features

Stuck between PoC and production? Book a call.

Get in touch →