Why Your Eval Suite Matters More Than Your Prompt

Here is a pattern I see so often it has become predictable. A startup ships an AI feature. The prompt is good. The feature works well enough that the founder is excited. Over the next three months, the team makes dozens of small changes — a new model version, a prompt tweak, a different chunking strategy, a system prompt tune. Each change is defended by someone on the team with "I tested it and it looks better." No one is sure whether the feature is getting better overall or worse. Three months in, the product either feels vaguely worse than it used to (nobody can say exactly why) or the team refuses to make any changes at all because they are afraid of regressions.

The missing thing is an eval suite. Not a fancy one. Just a real one. The team that has an eval — even a crude eval with 30 hand-graded examples — will beat the team with better prompts and no eval every time. This article is about why that is true and how to build the eval that does the work.

What an eval actually is

An eval suite is a set of input-output pairs plus a scoring function. You run your AI feature against the inputs, compare the outputs to the expected outputs (or grade them on quality dimensions), and get a number. That number goes up or down as you make changes. When the number goes down, you revert. When it goes up, you ship.

That is the entire concept. Everything else is implementation.

The reason it matters so much is that LLM behavior is non-monotonic. A change that makes one case better almost always makes another case worse. Without an eval, you are only looking at the cases you happened to think of, and you will not catch the regressions on the cases you did not think of until a customer reports them. With an eval, the regression shows up immediately, on a test case you already had, before you ship.

The minimum viable eval

The minimum eval that works is smaller than founders expect. I install evals in 1–3 days at new engagements. Here is the shape.

30–50 real examples. Pulled from actual user data (anonymized if needed), not made up. The examples should cover the real distribution of things users do with the feature, including the weird cases and the boring cases.

An expected output or a grading rubric for each example. Either a "correct answer" the feature should produce, or a list of criteria the output should meet. Grade by hand or use an LLM-as-judge — both are fine at this scale.

A script that runs the feature against every example and spits out a score. 30 minutes of engineering time. Should be runnable from the command line as npm run eval.

A commit-to-score log. Every time the eval runs, record the commit hash and the score. This is the history that tells you which changes helped.

That is the whole thing. You can build it in less than a week for almost any feature. The team that tells you they do not have time for an eval is telling you they do not understand how much time the lack of an eval is costing them.

The scoring approaches

For any given eval example, you need a way to decide if the output is good. Four approaches in increasing sophistication:

Exact match. Did the output exactly match the expected? Only works for structured outputs (extraction, classification, code generation with deterministic correctness).

Semantic match. Does the output mean the same thing as the expected, ignoring word choice? Use a string similarity metric or a small embedding similarity score. Works well for factual answers.

LLM-as-judge. Give a model the input, the output, and a rubric, and ask it to score the output. Works surprisingly well at scale and is the default for subjective criteria (tone, helpfulness, completeness). Budget for the fact that the judge is itself an LLM call and has its own failure modes.

Human grading. A human reads the output and scores it. The gold standard for subjective criteria, but does not scale past a few dozen examples per run. Useful as a calibration step for LLM-as-judge.

Most production evals use a mix — exact match where possible, LLM-as-judge for subjective dimensions, human spot-checks to calibrate the LLM-as-judge.

The dimensions that matter

A single overall score is a starting point but not enough. Break the score into dimensions that match what you care about. For most features, the dimensions are some subset of:

Correctness. Did the output say the right thing? Factually accurate, no hallucinations, citations are real.

Completeness. Did the output cover all the parts of the question, or did it miss something?

Format. Is the output in the expected shape (JSON structure, length bounds, required fields present)?

Tone and style. Does the output match the voice and formality the feature should have?

Safety. Did the output avoid the things it should avoid (PII leaks, off-topic responses, unsafe recommendations)?

Latency and cost. How fast and how much? Not quality dimensions per se, but worth tracking alongside quality because the trade-offs matter.

Track each dimension separately. A feature that improved on correctness but regressed on latency is not obviously better — the team should see both numbers and decide.

Where evals come from

Building the eval set is the hardest part and the most often skipped. The three sources I pull from:

Real user data. Scrub and anonymize recent real requests. This gives you the actual distribution of what users do, which is the only distribution that matters.

Hand-curated edge cases. The things you think might break the feature. The weird inputs. The ambiguous queries. The adversarial inputs if security is in scope.

Regression bugs. Every time a user reports a bug, add the input to the eval with the correct expected output. This ensures the bug never comes back silently.

The eval grows over time. It is not a one-and-done artifact. An eval that was 30 examples in week 1 should be 200 examples in month 6 and 1000 examples by the end of year 1. The growth is gradual and driven by real events.

The ritual

An eval is only useful if you run it. The minimum ritual:

The eval runs automatically in CI on every PR that touches the AI feature.
The PR cannot merge if the score drops below the last good commit by more than a set threshold.
On every significant change (new model, new prompt, new pipeline stage), the eval runs and the result is reported in the PR.
Once a week, somebody looks at the eval score trend across the whole feature and asks whether it is going in the right direction.

This is not heavy. It is the same discipline you would have for a normal test suite, applied to LLM quality.

The things teams usually get wrong

The eval is too small. A 5-example eval is better than nothing, but it will miss most regressions. Aim for 30 minimum, 100+ if the feature matters.

The examples are cherry-picked for ones the feature already handles well. If your eval only contains cases the feature already passes, the eval cannot detect improvement. Include hard cases on purpose.

LLM-as-judge with an unclear rubric. Vague rubrics produce unreliable grades. Write rubrics that are specific enough that two humans would grade the same output the same way.

No baseline. Without a baseline score, you cannot tell whether a change was an improvement. Always grade the current version before making changes.

Eval drift. The team updates the eval set whenever they change the feature, so the score is not comparable across time. Keep a frozen baseline set that never changes, and add new cases to a separate growing set.

Counterpoint: the eval is not the product

A warning. I have seen teams fall in love with their eval to the point where they optimize for the score at the expense of what users actually care about. This is Goodhart's Law in action — the metric becomes the target and stops measuring the thing. The eval is a tool, not a replacement for talking to users. Watch the eval, watch the user feedback, and be suspicious when they diverge.

The non-obvious benefit

Beyond catching regressions, the eval suite changes how the team talks about changes. Instead of "I think this prompt is better" — a statement that cannot be confirmed or refuted — the team talks in numbers. "This change moved the eval from 0.82 to 0.86 overall, with a drop on the formatting dimension we should investigate." That conversation is faster, clearer, and more productive than the alternative. The eval is not just a test — it is the shared language the team uses to reason about quality.

Your next step

This week, spend two hours building a 30-example eval for your highest-stakes AI feature. Pull the examples from real data. Write a simple script that runs the feature against each one. Grade the outputs yourself. You will learn more about your feature in those two hours than in the last month of prompt tweaks.

Where I come in

Installing evals is one of the first things I do at any AI-heavy engagement. 3–5 days to build the first version, a clear ritual around it, and a team that shifts from vibes-based decisions to measurement-based ones. Book a call if your AI feature is drifting and you are not sure why.