← All posts
AI / LLM Strategy

AI Gateways: Why Your LLM Stack Needs a Router in 2026 (And How to Pick One)

Calling OpenAI directly from your app was fine in 2023. In 2026 it is a liability — cost, reliability, governance, and vendor lock-in all suffer. Here is the fractional CTO's playbook for adopting an AI gateway without over-engineering it.

Craig Hoffmeyer9 min read

A founder called me last week in a minor panic. Anthropic had a regional incident during his product's launch window, his app dropped every inference request for twenty-three minutes, and his Slack channel filled with support tickets and one very annoyed design partner. "I thought we were resilient," he said. They were resilient in the traditional sense — load balancers, read replicas, multi-AZ deploys. But they were calling api.anthropic.com directly from their application code, with a hardcoded model name and no failover. When that endpoint went sideways, so did they.

This is the state of a depressing number of AI-native startups in April 2026. The LLM call is treated as if it were any other HTTP request — point it at an SDK, ship it, move on. But an LLM call is not any other HTTP request. It is an expensive, high-variance, provider-specific operation with token-based pricing, streaming semantics, caching opportunities, and a failure distribution that looks nothing like a REST endpoint. If you are running AI features in production and you are not putting a gateway in front of your model calls, you are leaving reliability, cost, and governance on the table.

What an AI gateway actually is

An AI gateway is a thin routing and policy layer that sits between your application and one or more model providers. Conceptually it is the same shape as an API gateway — ingress, auth, policy, egress — but the policies and telemetry are LLM-aware. At minimum, a production AI gateway gives you four things: unified API access across providers, automatic fallback and retry, observability over tokens and cost, and policy enforcement (rate limits, budgets, PII filtering, prompt injection defense).

The naming varies — LLM router, model gateway, inference proxy, LLM control plane — but the pattern is the same. Your app calls the gateway. The gateway decides which provider and model to use, enforces guardrails, records telemetry, and returns a response that looks the same regardless of what happened under the hood. Think of it as the React-Query or Stripe-adapter layer for your model calls: a boring piece of plumbing that quietly earns its keep every single day.

The four jobs a gateway actually does

Most founders hear "AI gateway" and think "thing that routes between OpenAI and Anthropic." That is one of the jobs, but it is not the most important one.

Job one: reliability. A gateway gives you retry, fallback, and circuit-breaker behavior across providers. When Anthropic's us-east endpoint returns a 529, the gateway retries against eu-west, and if that fails, falls back to OpenAI with a mapped-equivalent model. Your application never sees the failure. This is the job that pays for the gateway the first time a provider has a bad afternoon — which, based on the last eighteen months of incident reports, will be roughly monthly.

Job two: cost control. Token usage grows silently. Without a gateway, you find out you are spending $40K a month on LLM calls when the invoice arrives. With a gateway, every request is tagged with a team, feature, user, or workflow, and you have real-time dashboards and hard caps. A gateway is also where prompt caching and semantic caching live most cleanly. Semantic caching implemented at the gateway layer cuts inference costs 40–70% and shaves hundreds of milliseconds of latency off cached responses. Anthropic's prompt caching gives you a 90% discount on cached-prefix tokens and OpenAI's is free and automatic for repeated prompts. Both work best when the caching decisions are made at a central layer, not inline in every service that calls an LLM.

Job three: observability. Request-level traces with token counts, latency percentiles, model, prompt template, user, and cost per call — for every provider, in one place. You cannot debug a reliability problem, tune cost, or detect abuse without this. If you have tried to do it by tailing logs, you know why.

Job four: governance. Virtual keys, per-team budgets, rate limits, audit trails, PII redaction, content filters, prompt injection detection. Most startups do not need all of this on day one. Every startup with more than ten employees needs at least half of it by the time they hit their first enterprise security review.

A gateway that does all four of these well gives you back three things your direct-SDK setup was quietly costing you: headroom to negotiate, leverage to experiment, and evidence to show auditors.

The five options worth considering in April 2026

The landscape has consolidated meaningfully over the last twelve months. Here is how I think about the main options when a client asks.

LiteLLM is the open-source de facto standard. If you are a technical team that wants full control, self-hosting, and a provider catalog that covers essentially everything, this is your starting point. It is not the fastest — Kong's benchmarks show roughly 95ms of overhead versus 12ms for Kong — but for most SaaS apps with sub-second LLM round trips, 80ms is noise. Best for: engineering-led startups that value control over managed convenience.

Portkey is the most feature-dense managed gateway. It claims over 1,600 model integrations, has excellent observability, and its fallback, semantic caching, and cost-routing features are the most mature in the managed category. About 27ms of overhead. Best for: teams that want production observability out of the box and are willing to pay for a hosted control plane.

Kong AI Gateway is the right answer if you are already running Kong. It extends the existing API gateway with AI-specific plugins, adds around 12ms of latency, and gives you the best governance story for enterprise deployments. Since version 3.8 it has included a semantic cache plugin that uses Redis-backed vector storage. Best for: organizations with existing Kong footprint or strict latency SLOs.

Cloudflare AI Gateway is the zero-infrastructure option. If your app already lives on Cloudflare, turning on AI Gateway is a checkbox and you get logging, caching, and basic analytics essentially for free. Less feature depth than Portkey or LiteLLM, but a genuinely good starting point for seed-stage teams. Best for: Cloudflare-native apps that want observability without a new dependency.

OpenRouter is a different category — it is an aggregator that gives you one API across many providers and adds a 5–15% markup on top of underlying model pricing. Useful for rapid experimentation and for consumer apps that want exotic models without separate vendor contracts. Production teams consistently outgrow it because the markup compounds and the cost controls are thin. Best for: prototyping and breadth, not production scale.

I tend to recommend LiteLLM for most of my clients. It is open source, the community is large, the provider coverage is comprehensive, and you can bolt on Portkey or a dedicated observability tool later if you outgrow the built-in features. For non-technical teams or Cloudflare-native apps, I'll push toward Cloudflare AI Gateway or a managed Portkey deployment to keep the operational surface small.

The adoption path I actually use

Here is the practical sequence I walk clients through.

Step one: stop calling providers directly. Even if you are not using a gateway yet, introduce a thin client in your codebase that every LLM call goes through. This is thirty lines of TypeScript or Python. It gives you one place to add a gateway later without chasing down dozens of call sites. If you are already past this point and have a openai.chat.completions.create(...) sprinkled across ten services, your first gateway task is consolidation, not deployment.

Step two: deploy a gateway in shadow mode. Route a small percentage of traffic — 5–10% — through the gateway and compare latency, cost, and output quality to your direct-API baseline. Do not try to switch everything at once. Gateway migrations fail when teams assume the gateway is transparent; it never is on day one.

Step three: turn on observability and cost tagging. Before you use the gateway for anything else, make sure every call is tagged with the team, feature, user, and workflow. The data you collect in the first two weeks will change what you prioritize next. Most teams discover that one feature accounts for 60% of their LLM spend and that another feature has a 30% error rate that was invisible without gateway traces.

Step four: enable caching. Turn on prompt caching for any prompt with a stable prefix of more than 1,024 tokens. Turn on semantic caching for FAQ-style workloads — support, documentation Q&A, common internal tools. Measure cache hit rate. A well-tuned semantic cache hit rate for support use cases is 30–50%, which translates directly to cost and latency savings.

Step five: configure fallback. Pick two providers with equivalent-enough models for your use case. Configure the gateway to fall back automatically if the primary provider returns a 5xx or times out. Test it by killing the primary in staging. This is the single highest-ROI gateway feature.

Step six: add governance. Virtual keys per team, per-workflow budget caps, rate limits per user. This is where the gateway stops being a convenience and becomes a control plane.

Most clients get to step five within two weeks of the initial deploy. Step six is usually driven by an event — a runaway job, an enterprise security review, or a headcount that made shared provider keys untenable.

The counterpoint — when you should not

Because I am a fractional CTO and not a gateway vendor, here is the honest counter-argument. If you are pre-product-market-fit, have fewer than 1,000 LLM calls a day, and are using one provider with one model, a gateway is premature. Ship the feature. Watch the cost. Revisit when cost crosses $500 per month or when you have two or more LLM-backed features in production.

If your LLM use case is a single background batch job that runs nightly, you probably do not need a gateway either. You need a retry-with-backoff wrapper and a cost dashboard. Do not add infrastructure to solve a problem you do not have yet.

The inflection point is usually around 10,000 calls per day, multiple features, or the first conversation with a compliance officer — whichever comes first. Below that, a thin client wrapper and a spreadsheet are sufficient. Above it, you are paying the gateway tax either way; better to pay it deliberately than to reconstruct the same features in application code over six sprints.

Your action checklist

Here is what I would do this week if you are running AI features in production without a gateway.

  1. Audit your current LLM calls. List every place in your codebase that calls a model API. Include background jobs, edge functions, and embedded prompts in Retool or Zapier. You cannot route what you cannot see.

  2. Tag your top five cost drivers. Even without a gateway, instrument your existing calls to tag each request with feature name and user ID. Export the data to a dashboard. You will be surprised which feature is expensive.

  3. Pick a gateway and deploy it in staging. LiteLLM for engineering-led teams, Portkey for feature depth, Cloudflare AI Gateway for Cloudflare-native stacks. Get one call running through it end-to-end.

  4. Shadow 10% of production traffic. Compare latency, cost, and output to your direct baseline for at least a week before cutting over.

  5. Enable prompt caching and fallback. These are the two features that pay for the gateway immediately. Everything else can wait.

  6. Write an on-call runbook. Document what happens when the gateway itself has a problem. Your gateway is now a critical path dependency; treat it that way.

  7. Revisit quarterly. The gateway landscape is moving fast. Reassess every three months — cheaper providers, better caching, and new governance features will shift the calculus.

Where I come in

Most of the AI reliability, cost, and governance work I do as a fractional CTO routes through a gateway conversation at some point. Whether you are picking your first one, migrating off OpenRouter to cut markup, or tightening governance before an enterprise deal, I can help you make the call and land the migration without derailing your roadmap. Book a 30-minute call and bring your current LLM invoice — we will find the waste and the reliability gaps in the first fifteen minutes.


Related reading: The Hidden Cost Curve of LLM Features · Model Selection Guide for Product Teams · The Multi-Agent Playbook · Observability on a Budget · The Seed-Stage Stack

Picking an AI gateway? Let's talk architecture.

Get in touch →