← All posts
Founder Playbooks & Case Studies

Case Study: How We Cut an LLM Bill by 72% in Three Weeks

A real engagement (details anonymized) where a Series A startup was bleeding $38K/month on LLM APIs. The specific audit I ran, the changes we made, and the final bill.

Craig Hoffmeyer7 min read

A few quarters ago I got a call from a Series A founder I had met through a warm intro. The company was a vertical SaaS product with a heavy AI feature set — think document processing with agentic summarization and extraction. Things were going well. Revenue was growing. The investors were happy. One number was not happy: the LLM bill.

The conversation lasted 20 minutes. The punchline was that the monthly OpenAI and Anthropic combined bill had climbed from $4,800 to $38,200 in five months, was on track to cross $50,000 the following month, and nobody internally could explain why. The CFO was asking increasingly pointed questions. The founder wanted an honest outside read.

I took the engagement as a 3-week focused audit and optimization sprint. This article is what happened — the numbers, the findings, and the changes. Details are anonymized but all numbers are real and representative.

The starting point

Before touching anything, I spent two days collecting the data. Here is what the baseline looked like at the start of the engagement:

Monthly LLM spend: $38,200 (combined OpenAI + Anthropic).

Traffic: About 180,000 LLM calls per day across four features. No per-feature attribution existed — I had to build it.

Team: 4 engineers, all contributing to AI features, no dedicated AI owner. No evaluation infrastructure. No cost dashboards. Framework of choice was a popular agentic LLM framework that I will not name directly.

Features using LLMs:

  1. Document summarization — the original feature, shipped 10 months ago. Stable, moderate usage.
  2. Agent-based extraction — newer, launched 4 months ago, responsible for most of the cost growth.
  3. Chat interface — customer support chat, shipped 2 months ago, smaller volume.
  4. Background classification — an internal-facing feature that scored documents for triage.

I spent the first week tagging every LLM call with a feature, logging the input and output token counts, and building the cost attribution dashboard that should have existed from day one.

What I found

Once I had per-feature attribution, the story told itself. The monthly breakdown looked roughly like this:

  • Agent-based extraction: $24,100 (63%) — the runaway.
  • Document summarization: $7,400 (19%).
  • Background classification: $4,200 (11%).
  • Chat interface: $2,500 (7%).

The agent feature was the focus. Digging in, I found four compounding problems.

Problem 1: Uncapped agent loops. The framework the team was using had a default max-iteration count of 15. On roughly 8% of requests, the agent would use all 15 iterations before returning. Each iteration was a full model call. Most of those late iterations were producing degraded or duplicate output that nobody would look at.

Problem 2: Context re-sending on every iteration. Each iteration of the agent was sending the full accumulated context back to the model, including previous tool-call responses. By iteration 12, a single user request had sent over 140,000 input tokens total across iterations — for a task the team thought used "about 4,000 tokens."

Problem 3: Wrong model tier for routine cases. The team was using a frontier model for every request. About 70% of the extraction cases were routine and could be handled by a much cheaper model at a fraction of the per-request cost. The 30% of cases where the cheaper model failed were the cases where the frontier model was genuinely earning its keep.

Problem 4: No caching of system prompts. The system prompt for the extraction agent was about 3,200 tokens long. Anthropic's prompt caching would have cut the cost of those tokens by 90%. The team had not enabled it.

The other features had smaller versions of similar problems. The document summarization feature was sending retrieval-augmented context that had grown from 6 chunks at launch to 18 chunks by the time I looked. The chat interface had output tokens set to 4,096 when most responses only needed 300.

The changes

Across three weeks, the team and I made the following changes. I was mostly the designer and coach; the engineers did the actual implementation.

Week 1: Agent loop cap and routing. We hard-capped the agent iteration count at 6 (down from 15). We added a routing layer that tried the cheaper model first and escalated to the frontier model only on failure. We added telemetry for both so we could measure the quality impact.

Week 2: Prompt caching and context trimming. We enabled Anthropic prompt caching for the extraction agent's system prompt. We audited the system prompt itself and cut it from 3,200 tokens to 2,100 tokens by removing dead few-shot examples. We redesigned the iteration state so it only re-sent the context changes, not the full history, on each iteration.

Week 3: Output controls and feature-level caps. We set sensible max_tokens on every feature. We capped the number of retrieval chunks in the document summarization feature at 6. We added a per-feature daily cost alarm that would page the on-call engineer if any feature exceeded 2x its previous week's daily average.

We also added a quality eval that ran the 50 most common extraction requests against the old configuration and the new configuration and scored them. The new configuration was within 2% of the old configuration on the quality metric the team cared about. That was the check that let the founder feel confident about the rollout.

The result

The bill for the month after the changes was $10,700, down from $38,200. A 72% reduction. Traffic was actually up 15% over the same period because of continued customer growth, so the per-request cost reduction was even more dramatic.

The quality did not noticeably degrade. Customer complaints did not spike. The team later told me that the speed of the agent feature improved noticeably because the capped iteration count meant responses came back faster even when the model was slower.

The engagement lasted three weeks and cost the founder a small fraction of what the monthly savings represented. The payback was less than a week. Six months later, the team was still using the per-feature cost dashboard and the monthly cost review ritual, and the bill had stayed under $15,000 even as usage grew further.

The lessons that generalize

This engagement was specific, but the lessons are not. Every LLM cost audit I run hits some combination of the same five things.

Lesson 1: You cannot manage what you cannot measure. The single most valuable early output of the engagement was the per-feature cost attribution dashboard. The team had been flying blind for months. Once they could see where the money was going, the conversation changed immediately.

Lesson 2: Frameworks have defaults that cost money. The agent framework's default iteration count was set for quality, not for cost. Every framework you use is making choices on your behalf and some of those choices show up on your bill. Read the defaults.

Lesson 3: Prompt caching is the easiest win available. Enabling it on a long system prompt that is reused across requests is usually a 40–70% cost reduction on that portion of the bill. It is a single API flag. There is almost no reason not to use it.

Lesson 4: Model routing is the second easiest win. Most tasks do not need a frontier model. Running a cheaper model first and falling back to the expensive one on failure typically cuts cost 60–80% on high-volume features without any quality degradation.

Lesson 5: Controls before heroics. The changes that actually moved the bill were not clever. They were caps, routing, and caching. The team did not need a rewrite or a new architecture. They needed an outside perspective and a focused few weeks.

What I would do differently

Every engagement teaches me something. On this one, I should have pushed harder for a longer evaluation suite before rolling out the model routing change. The 50-request eval I ran was enough to catch obvious quality regressions, but a week later the team found an edge case where the cheaper model was failing quietly on a specific document type. We had to add a routing rule to handle it. If I had built a more diverse eval set up front, we would have caught it pre-rollout.

The meta-lesson: at LLM scale, evaluation infrastructure is not optional. It is the thing that lets you make cost changes without fear.

Counterpoint: not every bill is this juicy

A note of honesty. 72% is on the high end of what I see in cost audits. The typical range is 40–60%. The specific combination of problems in this case — uncapped agents, frontier-model-for-everything, no caching, prompt bloat — is unusually lucrative. A team that is already running a disciplined LLM practice will see much smaller savings, maybe 15–25%, which is still worth doing but does not make for dramatic case studies.

Your next step

Pick one AI feature on your product. Take 30 minutes this week to answer four questions about it: (1) What is the per-request cost? (2) Is the model the right tier for the task? (3) Is prompt caching enabled? (4) Is there a loop or retry that could be capped? If you cannot answer any of those questions, you have the first output of the audit and a specific thing to fix this week.

Where I come in

LLM cost audits are one of the highest-ROI engagements I offer. Two to three weeks, a real cost dashboard, a prioritized list of changes, and a final number the founder can take to the CFO. Book a call if your LLM bill has been climbing and you want an honest independent read on where the money is going.


Related reading: The Hidden Cost Curve of LLM Features · Your Cloud Bill Is a Strategy Document · Build vs. Buy vs. Wrap: A Founder's Framework for AI Features

Think your LLM bill is too high? Book a call.

Get in touch →