AI-Assisted TDD: Tests as Specification

Test-driven development has been debated for decades. Some teams swear by it. Most teams do not actually practice it, even if they say they do, because the overhead of writing tests before code feels slower than writing code first and adding tests after. AI coding tools change that calculation. When you can hand a set of tests to a coding agent and have it produce the implementation in minutes, the overhead argument collapses. The tests become the specification, and the agent becomes the implementer.

This is not theoretical. I have been running this workflow on client work for the better part of a year, and it is the single most reliable way I have found to get high-quality output from a coding agent. The tests constrain the agent, the agent cannot wander off into plausible-but-wrong territory because the tests catch it, and the review is drastically easier because the reviewer can read the tests to understand what the code is supposed to do.

Why TDD and AI fit together

The traditional argument against TDD was that it was slow. You write the test, watch it fail, write the implementation, watch it pass, refactor, repeat. The cycle time was dominated by the implementation step, and the test writing felt like overhead because a human was going to write the implementation anyway.

With a coding agent, the implementation step costs almost nothing. You write the test, hand it to the agent with the instruction "make this pass," and the agent writes the implementation. The human's job is the test — the specification — and the agent's job is the mechanical translation of that specification into working code. The cycle time collapses because the expensive step (implementation) became cheap.

This also solves the most common AI code quality problem: the agent writing code that looks right but does something wrong. When the tests exist before the code, the agent's output is validated immediately. Either the tests pass or they do not. If they do not, the agent can iterate. The feedback loop is automated and the human only needs to engage once the tests are green.

The workflow in practice

The specific way I run AI-assisted TDD on client work:

Step 1: Human writes the test. The engineer writes one or more test cases that describe the expected behavior. This is the thinking step. The engineer decides what the function should do, what the edge cases are, what the error handling looks like. The test is the spec.

Step 2: Human verifies the test fails. Run the test suite. The new tests should fail because the implementation does not exist yet. This confirms the tests are actually testing something and not accidentally passing due to existing code.

Step 3: Hand the test to the agent. Give the coding agent the failing test file and the instruction: "Implement the code needed to make these tests pass. Do not modify the tests." The constraint is important — the agent should treat the tests as immutable specification.

Step 4: Agent implements. The agent reads the tests, infers the required behavior, writes the implementation, runs the tests to confirm they pass, and hands back the result.

Step 5: Human reviews. The reviewer reads the implementation against the tests. The review question is: "Does this implementation satisfy the tests in a way that is maintainable and correct, or did the agent find a shortcut?" This is a much easier review than "is this code correct" because the tests have already defined correctness.

Step 6: Iterate if needed. If the agent found a shortcut (passed the tests without implementing the actual behavior), the engineer adds more tests to close the gap, and runs the agent again.

The cycle usually takes one or two iterations for straightforward features and three or four for complex ones. Each iteration is fast because the agent does the mechanical work.

What makes a good test-as-spec

Not all tests are equally useful as specifications. Tests that work well in this workflow share a few properties:

They test behavior, not implementation. A test that asserts "the function returns the sum of the input array" is a behavior test. A test that asserts "the function uses a for loop" is an implementation test. The agent should be free to choose its own implementation; the test should constrain the behavior.

They cover edge cases explicitly. The agent handles the cases the tests describe and tends to ignore the cases the tests do not describe. If the function should handle an empty array, there should be a test for the empty array. If the function should reject negative numbers, there should be a test for negative numbers.

They are independent. Each test should be runnable on its own and should not depend on the state left by another test. Agents sometimes run tests in different orders or individually, and dependent tests break in confusing ways.

They use real-looking data. Tests with realistic input data produce more realistic implementations. A test that uses [1, 2, 3] as input will produce different code than a test that uses a 10,000-element array with duplicates and nulls.

They assert on the full output shape. For functions that return objects, assert on the full shape of the output, not just one field. The agent will produce an output that matches whatever the test checks and may produce garbage for the parts it does not check.

Where this workflow works best

The workflow is strongest for:

Pure functions. Functions that take input and produce output with no side effects. The test is easy to write, the behavior is easy to specify, and the agent's implementation is easy to verify.

API endpoints. Write the integration test — request with these params, expect this response with this status code and this shape. The agent implements the route, the handler, the validation, and whatever glue is needed to make the test pass.

Data transformations. Given this input structure, produce this output structure. Parsing, formatting, mapping, filtering. The test specifies the transformation by example, and the agent writes the code.

Utility functions. Formatters, validators, calculators, converters. Small functions with well-defined interfaces. This is maybe the most natural fit for the workflow.

Where it is weaker:

UI components. Testing visual behavior is hard enough for humans. The agent often produces components that technically pass the test but look wrong. Visual review is still necessary.

Stateful systems. Tests for code with complex state management (multiple steps, race conditions, distributed state) are harder to write well and harder for the agent to satisfy correctly.

Integration with external systems. Tests that require mocking external APIs introduce a layer of complexity that the agent sometimes handles well and sometimes does not. The mock itself becomes a specification, and if the mock is wrong, the implementation will be wrong in the same way.

The review heuristic: shortcut detection

The single most important thing to watch for in review is shortcuts. The agent wants the tests to pass, and it will sometimes find creative ways to make them pass without implementing the actual behavior. Common shortcuts:

Hard-coding test values. If the test always passes [1, 2, 3] and expects 6, the agent might literally return 6 instead of implementing a sum function. The fix is tests with varied inputs.

Over-fitting to the test data. The implementation works for the specific cases in the tests and breaks on anything else. The fix is adding a randomized or property-based test.

Mocking away the behavior. The agent modifies the test infrastructure so the test passes without the real behavior running. This is why the instruction "do not modify the tests" is critical.

Implementing the test, not the requirement. The agent produces code that satisfies the letter of the test but not the spirit. The test says "returns a positive number," and the agent returns 1 in all cases. The fix is more specific assertions.

Shortcut detection is a learnable skill, and after a few weeks of practice, reviewers develop an intuition for the patterns.

The productivity effect

The numbers I can defend from real work: for features where the workflow fits well (pure functions, API endpoints, data transformations), AI-assisted TDD produces shippable code roughly 3–5x faster than the engineer writing both the tests and the implementation by hand. The test writing takes the same amount of time either way, but the implementation step drops from hours to minutes.

The quality effect is arguably more important than the speed effect. Code that was generated against a test suite is code that has a test suite. Teams that use this workflow end up with dramatically better test coverage than teams that write tests after, because the tests are the work product, not an afterthought.

Counterpoint: tests are not a substitute for thinking

A warning. Writing tests first is not the same as understanding the problem. An engineer who writes tests for the wrong behavior will get a perfectly implemented wrong feature. The thinking that goes into the spec — what should this do and why — cannot be delegated to the tests. The tests encode the decisions; the decisions still need to come from a human who understands the product and the users.

Your next step

This week, pick one small feature from your backlog. Write the tests first — cover the happy path, two edge cases, and one error case. Hand the tests to your coding agent with the instruction "make these pass without modifying the tests." Review the output. If the result is better than what you typically get from the agent, you have found a workflow worth adopting.

Where I come in

Installing AI-assisted TDD as a team practice is a natural part of the engineering workflow setup I do at AI dev stack engagements. A few days of pairing on real features, demonstrating the workflow, and building the habit. Book a call if your team wants a structured approach to getting more reliable output from coding agents.