Definition of Done in an AI-Assisted Codebase: What Actually Counts as "Shipped"

Last week, I reviewed a PR that an engineer had marked "ready for deploy." The code was generated by Cursor (an AI coding assistant), passed tests, and had no linter errors. The human engineer's job was to run the AI, approve the output, and ship it.

I blocked it.

Not because the code was bad. But because the engineer and I had different ideas about what "done" means in an AI-assisted codebase.

For human-written code, a Definition of Done (DoD) checklist is straightforward: unit tests, integration tests, code review, linting, docs. But AI changes the game. The generated code might be syntactically perfect and still be wrong in subtle ways. Tests can pass and miss edge cases. Code review becomes AI audit, not human code reading.

And nobody's talking about what "done" means anymore.

This is the post I needed two months ago when I realized my team and I were marking different things as shipped.

Why Definition of Done Matters Even More With AI

A Definition of Done is a shared understanding of what "shipped" means. Before AI, it was relatively simple: tests pass, code review approval, no linter errors, documentation updated.

With AI-generated code, the stakes are different.

AI-generated code requires different scrutiny. An AI wrote it, possibly without the context you have. It might be cleverly wrong. It might misinterpret the spec. It might introduce a subtle security bug that tests don't catch. Human review isn't about style anymore; it's about correctness and safety.

Tests need to be more comprehensive. If you're leaning on AI to generate the code, you can't lean on spot-checks in code review to catch bugs. Your test suite has to be comprehensive enough that a well-tested piece of AI code is actually trustworthy. Partial coverage isn't good enough anymore.

Spec matching becomes critical. AI can hallucinate. It might generate a feature that's close to the spec but not exactly right. The reviewer has to verify: "Does this actually do what was asked?" This is exhausting if specs are vague.

Security becomes a distinct gate. AI is good at generating code. AI is bad at thinking about threat models. If your code touches auth, payments, or PII, a human has to audit for security explicitly. "Tests pass" doesn't mean "secure."

Documentation is now mandatory. If the code was AI-generated, there's a higher chance the human who touches it next won't understand why it was written that way. Docs aren't optional; they're the bridge between AI output and human understanding.

The second-order effect: you're shipping faster, so the bar for "done" has to be higher, not lower. You can't afford to ship partially-baked code because your cycle time is now measured in hours.

The Updated Definition of Done for AI-Assisted Code

Here's what I'm using in production teams right now:

1. Spec match (human verification). Does it do what the spec says, exactly? Not "close." Not "most of it." All of it. If the spec is vague, clarify it before marking done. This is non-negotiable with AI-generated code.

2. Test coverage ≥ 85% minimum (AI-generated code), ≥ 70% for critical paths (human-written). AI code needs higher coverage because you can't rely on clever reviewers to spot gaps. Critical paths (auth, payment, data deletion) always need 85%+, human or AI.

3. Edge case audit. Spend 30 minutes asking: "What could go wrong?" Input validation, boundary conditions, state transitions, concurrent access. AI often misses edge cases. This is not exhaustive testing; it's a deliberate search for failure modes.

4. Security audit (if code touches sensitive data). Does it handle PII safely? SQL injection risks? Race conditions? Privilege escalation? If you can't verify this in 20 minutes, get a security expert involved before shipping.

5. Code review by human familiar with the problem domain. Not just "does this pass tests," but "does this make sense?" The reviewer should be someone who understands why the feature matters and can spot if the implementation misses the intent.

6. Linting and type safety. ESLint, TypeScript strict mode, whatever. This is table stakes and non-negotiable. AI-generated code should be held to the same standard as human code.

7. Documentation of non-obvious choices. If the AI chose an algorithm, data structure, or approach that's clever but not obvious, document it. Future humans (including future you) will thank you. This is especially important for AI code because the reasoning is often not transparent.

8. Rollout strategy. If you're shipping to production, how are you rolling it out? Feature flag? Canary? Full deploy? For AI-generated code, even if it's low-risk, I prefer a controlled rollout for the first release. If something's weird, you find out fast without impacting all users.

9. Observability in place. Metrics, logs, error handling. If this code breaks in production, can you see it? Can you roll it back? This is especially important for high-stakes code (payments, auth, data writes).

10. Async sign-off. The human who generated the code should not be the only reviewer. Someone else should look at it and agree it's done. This is a forcing function that prevents teams from shipping garbage because they're tired.

Notice what I don't require: "Human wrote every line." The code source doesn't matter. What matters is whether it's correct, safe, and maintainable.

A Concrete Example: The Payment Code That Almost Shipped Wrong

Three weeks ago, an engineer used Claude to generate a refund handler. It looked correct. Tests passed (92% coverage). No linter errors. The engineer marked it ready to deploy.

The code was:

async function refund(transactionId: string, amount: number) {
  const txn = await db.transactions.findOne({ id: transactionId });
  const refunded = await stripe.refunds.create({ charge: txn.stripeChargeId, amount });
  await db.transactions.update({ id: transactionId }, { status: 'refunded' });
  return refunded;
}

At first glance: looks good. Tests pass. So what's the problem?

I asked the engineer: "What happens if the Stripe refund succeeds but the database update fails? The money's refunded, but our system thinks it wasn't."

Pause. They hadn't considered it.

The spec didn't explicitly say "be transactional," but refunds are money. This needs to be transactional or you're creating a data consistency bug that's hard to debug in production.

The fix:

async function refund(transactionId: string, amount: number) {
  const txn = await db.transactions.findOne({ id: transactionId });
  
  if (!txn) throw new NotFoundError(`Transaction ${transactionId} not found`);
  if (txn.status === 'refunded') throw new AlreadyRefundedError(`Transaction already refunded`);
  if (amount > txn.amount) throw new ValidationError(`Cannot refund more than transaction amount`);

  try {
    const refunded = await stripe.refunds.create({ charge: txn.stripeChargeId, amount });
    await db.transactions.update({ id: transactionId }, { status: 'refunded', refundedAt: new Date() });
    return refunded;
  } catch (err) {
    // If Stripe succeeds but DB fails, we have a consistency issue
    // Log it; alert ops
    console.error('Refund consistency error:', { transactionId, stripeRefundId: refunded?.id, error: err });
    throw new InternalError('Refund completed in Stripe but DB update failed. Manual intervention required.');
  }
}

This is what human review catches. Not style. Not "make it shorter." But "you're about to cause a data consistency bug."

If the Definition of Done for this code had included "Edge case audit" and "Async sign-off by domain expert," this would have been caught before deployment.

The Counterpoint: DoD Can Become a Bottleneck

If your DoD is too strict, you stop shipping. I've seen teams where every PR requires a full security audit, performance profiling, and a sign-off from three engineers.

At that point, DoD is not a quality gate. It's a bureaucratic drag.

The balance: DoD should be proportional to risk.

Low-risk changes (docs, copy, non-critical UI): Minimal DoD. Tests or review, that's it.
Medium-risk changes (features, API changes, non-financial logic): Full DoD. Tests, code review, spec match.
High-risk changes (payments, auth, data deletion, security): Rigorous DoD. Tests, code review, security audit, spec match, async sign-off.

This is where teams get it wrong. They apply the same DoD to everything, which either means they ship garbage (loose DoD) or they ship slowly (strict DoD).

The rule: DoD complexity should match risk, not stay constant.

Where I Come In

I've built Definition of Done frameworks for a dozen teams transitioning to AI-assisted development. Each one has found the right balance between speed and safety. I know where the gaps are (security for payment code, async sign-off, rollout strategy) and where teams over-engineer (linting for AI code that already has better linting than human code).

If your team is shipping AI-assisted code and you're not sure if your DoD is appropriate, or if you're shipping too fast and want to tighten without creating bottlenecks, I can audit your process and build a risk-based DoD framework.

Let's talk about it.

Action Checklist

Write down your current Definition of Done. What's the checklist before something ships?
Mark each item as "Always required," "Required for high-risk code," or "Optional."
Look at your last 10 PRs that were AI-assisted. How many would have failed a rigorous spec-match check?
For each PR, identify the risk category (low, medium, high). Did the DoD complexity match the risk?
Pick one medium-risk feature shipped by AI. Do a post-mortem: What could have gone wrong? Would your current DoD have caught it?
Update your DoD to be risk-proportional. Add explicit gates for high-risk code.
Add "edge case audit" and "async sign-off" to your DoD for any code touching money, auth, or data deletion.

Definition of Done in an AI-Assisted Codebase: What Actually Counts as 'Shipped'