Claude, GPT, Gemini, or Open-Weights? A Model Selection Guide for Product Teams

The model selection conversation at a startup almost always sounds the same. Someone on the team has read a benchmark or a LinkedIn post. Someone else has a favorite. The founder is nervous about vendor lock-in. The engineer building the feature has just used the model they already had an API key for. No one has written down the criteria. The decision gets made by whoever talked last.

This article is the framework I use with clients to turn model selection from a vibes conversation into a decision you can defend. It is deliberately boring. It is not about which model is "best" — there is no answer to that question in the abstract. It is about which model is best for a specific feature, against a specific set of criteria, for your specific company.

The premise: model choice is per-feature, not per-company

The first mental shift is that "which model should we use" is the wrong question. The right question is "which model should we use for this specific feature, and how will we revisit that decision." Different features in the same product will have different right answers. A team that standardizes on one model across every feature is optimizing for convenience at the cost of everything else.

Most of my clients end up with three to five models in production across their features. The routing layer is usually a simple lookup table in application code, not anything fancy. The point is not to use every model — it is to match the model to the job.

The six criteria that matter

When evaluating a model for a specific feature, the six things that actually matter, in rough order of decision weight:

1. Quality on your task. How well does the model do the specific thing you need it to do? Not how well it does on MMLU. How well it does on your eval. If you do not have an eval, build one before you pick a model. See why your eval suite matters more than your prompt.

2. Cost per request. Input tokens, output tokens, and the frequency of the request all matter. A model that is 20% cheaper but takes 30% more tokens to complete the task is a wash. Measure end-to-end cost per successful completion, not per-token list price.

3. Latency. How fast does the model return a response? For interactive features, latency under 2 seconds matters a lot. For background features, 20 seconds is fine. Frontier models are usually slower than smaller models. Streaming can hide a lot of latency for user-facing features.

4. Context window. How much can you fit into one request? Long-context models eliminate the need for retrieval in some cases and are worth paying for when the task requires it. For short-context tasks, the extra window is wasted.

5. Privacy and data handling. Does the vendor train on your data by default? Can you turn it off? Do they offer zero-retention? Can they sign a BAA if you need HIPAA? Some of this is deal-breaker material for regulated industries and irrelevant for consumer products.

6. Vendor lock-in risk. How hard would it be to switch? Proprietary features (vendor-specific tool-use formats, structured outputs, prompt caching) are conveniences now and switching costs later.

The model categories as of 2026

A rough map of what is available. The specific vendors and models shift constantly — I am not going to name current-generation model numbers because they will be outdated by the time you read this — but the categories are stable.

Frontier closed models. The highest-quality, highest-cost models from the major labs. Anthropic's top Claude, OpenAI's top GPT, Google's top Gemini. Use these for hard tasks where quality matters and cost per request is low enough that the premium is justified. Bad fit for high-volume, low-stakes tasks.

Mid-tier closed models. The step down from frontier — cheaper, faster, usually still excellent. This is where most of my production routing ends up. 80% of real tasks can be handled by a mid-tier model with no noticeable quality drop and 5–10x cost savings.

Small closed models. The cheapest and fastest tier from the major labs. Good for routing decisions, classification, simple extractions, and anywhere else you are calling the model as a tool rather than as a writer. Not good for long reasoning or creative work.

Open-weights hosted. Llama, Mistral, Qwen, DeepSeek, and others, hosted on Groq, Together, Fireworks, or equivalent. Competitive quality at competitive prices. The main reason to use these over closed models is either privacy, cost at very high volume, or the ability to fine-tune. The main reason not to is that the tooling ecosystem around closed models is still more mature.

Self-hosted open-weights. Run the model yourself. Use this only if you have a real privacy requirement that the hosted options do not meet, or you have enough volume that the amortized GPU cost is below the API cost, or you have specific fine-tuning needs. It is more work than founders expect.

The decision framework

Walk through these questions for each feature you are picking a model for:

Q1: What is the quality floor for this task? If the task fails at anything less than frontier quality (medical reasoning, legal drafting, code review), you probably need a frontier model. If the task tolerates modest errors (classification, extraction, routing), a smaller model works.

Q2: How many times per day will this run? If the answer is "hundreds" or "thousands," cost will dominate the calculus. If the answer is "dozens," quality will dominate.

Q3: Is the user waiting on the response? If yes, latency is a hard constraint. Streaming helps. Background jobs give you flexibility.

Q4: Does the feature touch data you cannot send to a third party? If yes, you are narrowed to vendors that offer the right data-handling posture (zero-retention, BAA, or self-hosted).

Q5: Is the task easily verifiable? If you can easily check whether a response is correct, you can use cheaper models with a quality check. If verification is hard or subjective, you need higher-quality models up front.

Q6: How locked in would you be? If the feature uses vendor-specific formats or prompt caching heavily, budget engineering time to swap later. If the feature is a plain chat completion, swapping is a one-day job.

Write down the answers. Then pick the cheapest model that meets the hard constraints and hits your quality bar on the eval.

The routing pattern

A common architecture at clients: a thin routing layer in front of every LLM call that picks the model based on the feature and the request. Something like:

function pickModel(feature: string, input: RequestInput): ModelChoice {
  if (feature === "summarization-high-stakes") return "frontier-claude";
  if (feature === "classification") return "small-fast-model";
  if (feature === "chat-support") return "mid-tier-claude";
  if (feature === "bulk-extraction") return "open-hosted-llama";
  return "default-mid-tier";
}

This is not fancy. It is just a switch. But it lets you tune each feature independently, swap models without touching product code, and roll out new models to one feature at a time to measure impact.

On top of this, some teams add fallback routing: try the cheap model first, run a quality check on the response, and escalate to a better model on failure. This compounds the savings when the cheap model is good enough most of the time but unreliable on edge cases.

The mistakes I see most

Picking a model based on a benchmark nobody on the team replicated. Benchmarks are starting points, not decisions. Benchmarks are not your eval.

Standardizing on one model "for simplicity." The simplicity is fake. You are paying the frontier price for tasks that a small model would do fine, or you are getting mediocre results on the hard tasks because you picked a small model for the whole product.

Ignoring prompt caching support. If your feature has a long, stable system prompt, prompt caching can cut cost 40–70%. That alone changes which models are cost-competitive. Do not make model selection decisions without factoring in caching.

Locking in too hard on vendor-specific tool-use formats. Every vendor has their own flavor. Write your code against a thin abstraction so swapping is cheap. It will pay off within the year.

Refusing to revisit the decision. Model quality and pricing change every few months. Re-run your eval against the current options quarterly. The right model for a feature six months ago may not be the right one today.

Counterpoint: do not over-optimize

A warning. You can spend infinite time on model selection. For a feature that runs 20 times a day, it does not matter whether you pick the 3rd cheapest or the 5th cheapest. Pick one that passes the eval, ship the feature, and come back to optimize when the feature matters enough to be worth the effort. Over-engineering model selection is a great way to delay shipping.

Your next step

Pick one feature in your product that uses an LLM. Write down answers to the six criteria questions above. Run your eval (if you have one) against two other model choices. If the current choice is the best, confirm it. If not, plan the swap. Repeat for the next feature next week.

Where I come in

Model selection reviews are part of most AI engagements I take on. Two days of focused work, a clear eval run against the realistic candidates, and a specific recommendation per feature. Book a call if you want a second opinion on your current model choices.