← All posts
AI-Assisted Software Development

Coding Agents in Production: Where They Work and Where They Wreck You

An honest operational guide to running coding agents on real codebases — the tasks they handle well, the traps they fall into, and the guardrails that keep them from causing damage.

Craig Hoffmeyer8 min read

A coding agent is different from a coding assistant. A coding assistant suggests the next line and hands control back to you. A coding agent takes a task, reads files, makes decisions, edits code, runs commands, and comes back when it is done or stuck. The difference in control model is the difference between a tool and a delegate, and the operational implications are not the same.

I have been running coding agents on client work for most of the last two years. The honest take is that in 2026, they are unavoidable for any team serious about velocity, and they are also still unreliable enough that running them without guardrails will hurt you. This article is the operational version of "how do you actually do this well" — the tasks I delegate to agents, the ones I keep for myself, and the specific guardrails I insist on before letting an agent touch production-adjacent code.

What a coding agent actually is

For this article, a coding agent is a tool that:

  • Takes a task described in natural language.
  • Has access to a file system, a terminal, and usually some additional tools (web search, test runners, linters).
  • Plans and executes multi-step work autonomously.
  • Reports back with a summary of what it did.

Claude Code is the tool I use most often in this category. Cursor's agent mode, the various background agent products, and Copilot's agent modes are all in the same category. The specific tool matters less than the pattern — the pattern is "delegate a task and review the result."

The tasks coding agents handle well

Not all tasks are equal. Some are great for agents and some are terrible. The ones that work reliably, in my experience:

Well-defined refactors. "Rename this function everywhere it is used and update the tests." "Move all the API routes from the pages directory to the app directory." "Change this type from X to Y and fix all the callsites." Mechanical, scope-defined work where the success criterion is obvious.

Test writing. Given an existing function, write the tests. Given a new feature, write the tests. Agents are remarkably good at this, and the tests they write are usually better than the tests teams write when they are rushing. This is maybe the single highest-ROI use of agents.

Boilerplate and scaffolding. Setting up a new module, adding a new API endpoint that follows the existing pattern, creating a new page that matches the existing page structure. The work that was tedious for a human because it is pattern-matching work is fast for an agent.

Dependency updates. "Bump this package to the latest version and fix any breaking changes." Agents are surprisingly good at reading changelogs, spotting breaking changes, and making the required adjustments.

Small, well-scoped bug fixes. "This test is failing with this error. Figure out why and fix it." When the bug is localized and the symptoms are clear, agents often find and fix it faster than a human would have.

Codebase exploration. "Find everywhere the billing logic touches the user model and summarize it." Not code generation at all — just search and summary — but this is one of the uses I reach for most often and it saves enormous amounts of time.

Documentation generation. "Read this module and write a README that explains what it does and how to use it." Boring, time-consuming for humans, fast for agents.

The tasks coding agents do not handle well

And the inverse. Tasks where I have tried repeatedly and the results are either unreliable or actively dangerous:

Architectural decisions. Agents will happily propose architectures, and the proposals sound plausible, but they miss the strategic context (where the company is going, what constraints the customer has, what the team can maintain). Architecture is a thinking job, not a code-writing job.

Complex debugging across systems. When the bug involves understanding interactions between three services, a message queue, and a third-party API, the agent usually cannot hold enough context to reason about it correctly. It will propose local fixes that do not address the real problem.

Performance optimization. Agents can sometimes spot obvious problems (missing indexes, quadratic loops), but they are bad at the kind of optimization that requires profiling, measuring, and thinking about hot paths. They tend to optimize the wrong thing.

Security-sensitive code. I do not let an agent write auth code, crypto code, or anything touching secrets without detailed human review. The failure modes are expensive, the AI sometimes produces confident-sounding wrong answers, and the review effort is usually larger than the save from the generation.

Anything where the requirements are fuzzy. If I cannot articulate the task precisely, neither can the agent. Fuzzy requirements lead to fuzzy output. When the requirements are unclear, the work the agent needs is the clarity, not the code.

Work that requires reading code from outside the current repo. Agents often need to understand how a third-party library behaves. They guess based on the documentation they were trained on or can search. They are sometimes wrong.

The guardrails I insist on

Before I let an agent work on a client's real repository, I install specific guardrails. These are not optional.

1. Branch protection. The agent works on a branch. It cannot push to main. It cannot merge its own PRs. A human reviews and merges.

2. No destructive commands without approval. The agent has access to the terminal, but destructive commands (rm, git reset --hard, DROP TABLE) either require explicit confirmation or are blocked entirely. This prevents accidents.

3. Scoped file access. The agent has access to the project directory, not to the whole filesystem. It cannot read files outside the project. It cannot read the user's SSH keys or credentials. This is usually a default in agent tools; make sure you have not disabled it.

4. Secrets are not in the agent's context. Production credentials, API keys, customer data — none of these should be in the agent's working directory or accessible to it. Secrets go in the secret manager and are only loaded at runtime in the real environment.

5. Tests run before the human reviews. The agent runs the test suite before handing back the PR. If the tests do not pass, the agent either fixes them or reports the failure. A PR with failing tests is a waste of the human reviewer's time.

6. Clear task scoping. The task given to the agent should have clear boundaries: what files to change, what files not to change, what success looks like. Vague tasks produce vague output.

7. A time or cost cap. Agents can burn through a lot of compute on a hard task. Set a budget — a wall-clock cap, a token cap, or a cost cap — and fail the task if the agent exceeds it. Runaway agent costs are a surprisingly common support ticket.

8. Human review before anything ships. Obvious, but worth stating. No agent output goes to production without a human review. See reviewing AI-generated PRs for the specific review heuristics.

The team practice

Beyond the technical guardrails, running coding agents in a team context requires practices that are not yet standard:

The agent session is the author. Track which PRs were agent-authored and which were human-authored. This data is useful later — it tells you where bugs are coming from and whether your review practice is catching the AI-specific failure modes.

One person is on point for the agent's work. Even if three engineers are using agents, each agent session should have one human owner who is responsible for the outcome. Shared ownership leads to no ownership.

Escalation protocol when the agent is stuck. Agents sometimes loop on a problem, making the same mistakes repeatedly. The engineer running the agent needs to know when to stop, rescope, and either take over or abandon. My rule of thumb: if the agent is still wrong on the third attempt, the task is wrong, not the agent.

Weekly calibration. The same weekly calibration ritual I recommend for AI PR review (see that article) applies here. Look at the agent's recent work, find the patterns in what went well and what did not, and update the team's internal guidelines.

The cost reality

Coding agents are cheap relative to human engineering time and expensive relative to inline completions. A typical task that takes an agent 5–20 minutes of wall-clock time might cost $0.50–$5.00 in model usage. Multiply that by the number of tasks a team runs in a day and you get real numbers.

The framing that matters is cost-per-task versus value-per-task. A task that saves a human engineer 45 minutes at $3 of agent cost is a strong deal. A task that saves a human engineer 5 minutes at $2 of agent cost is not.

Be deliberate about which tasks you delegate. Delegating everything is expensive and unnecessary. Delegating the right things is transformative.

Where agents are going

Three things I expect to change in the next 12 months:

Reliability will improve. The tasks agents can handle will expand as the underlying models improve and the tool design matures. Tasks that are marginal today will be reliable in a year.

Background usage will become more common. Teams will queue tasks for agents to work on overnight, and come back to a stack of reviewable PRs in the morning. This is already happening; it will become more normal.

The hand-off from agent to human will get clearer. Better reporting, better explanations, better "here is what I tried and where I got stuck" output. The current state is too terse; the next state will be more readable.

Counterpoint: the agent is not a substitute for a senior engineer

A warning. A coding agent multiplies a capable senior engineer. It does not replace one. Teams of juniors trying to ship production code via agents with no senior review produce codebases that degrade faster than they would have without the agents. The value of agents is leverage for people who already know what they are doing. Do not expect them to close the gap for people who do not.

Your next step

This week, pick one task on your backlog that fits the "works well for agents" list — a refactor, a test suite, a dependency bump, a code exploration. Run an agent on it. Review the output carefully. Write down what went well and what did not. That single concrete experiment will teach you more about where agents fit in your workflow than any amount of reading.

Where I come in

Installing the right agent workflow for a team — the guardrails, the task selection, the review practice — is a common 1–2 week engagement. I do it alongside the team, not as a lecture. By the end, the team is running agents on real work with the safety nets in place. Book a call if your team is experimenting with coding agents and wants an operational playbook.


Related reading: The 2026 AI-Native Dev Stack · Reviewing AI-Generated PRs · A Fractional CTO's Claude Code Playbook

Running coding agents on real work? Book a call.

Get in touch →