← All posts
AI / LLM Strategy

Your AI Feature Is Probably a RAG Pipeline — Here's How to Do It Right

A practical guide to building retrieval-augmented generation features that actually work — chunking, embeddings, retrieval, re-ranking, evals, and the mistakes I see in almost every first attempt.

Craig Hoffmeyer8 min read

If you are building an AI feature at a SaaS startup in 2026, there is a roughly 70% chance the feature is — or should be — a retrieval-augmented generation (RAG) pipeline. "Ask the model a question about our internal documents" is RAG. "Summarize what this customer has said in the last six months" is RAG. "Draft a reply based on our past responses" is RAG. Most AI features that touch proprietary data end up in the same architectural shape, regardless of what the founder called it at the beginning.

The problem is that the first attempt at RAG is almost always bad, and the team usually does not know why. This article is the version of the RAG explainer I give clients when we sit down to plan one, or when I am reviewing an existing implementation that is not working as well as they hoped.

What RAG actually is

RAG is a pattern, not a product. The pattern is: when a user asks a question, you retrieve the relevant chunks of your own data, stuff them into the model's context along with the question, and let the model answer using that context. The model itself does not know your data — it gets the data handed to it at inference time, fresh, for every request.

That is the whole idea. Everything else is implementation detail. The implementation detail is where RAG pipelines succeed or fail.

The five stages of a RAG pipeline

A working RAG pipeline has five distinct stages, and most first attempts get at least two of them wrong. The stages are: ingestion, retrieval, ranking, generation, and evaluation. Each is its own problem.

Stage 1: Ingestion

Ingestion is how your data gets from its source into a form the retrieval stage can query. It has three sub-steps — loading, chunking, and embedding — and chunking is where most first attempts go off the rails.

Loading is boring in concept and painful in practice. You need to read documents, clean them, extract text, and preserve enough metadata (source, date, author, section) that you can attribute and filter later. HTML, PDFs, Word docs, Google Docs, Slack threads, and database rows all have different loaders. Each loader has its own edge cases. Budget real engineering time for this.

Chunking is how you split a long document into pieces small enough to fit in context. The naive approach — "split every 500 tokens" — works badly because it cuts across sentences, paragraphs, and semantic boundaries. A section that was about one topic gets split across two chunks, and neither chunk contains enough context to be useful.

The better approach is structure-aware chunking. Respect the document's existing structure (headings, sections, paragraphs). Aim for chunks of 200–800 tokens with ~15% overlap between adjacent chunks. Keep chunks together when they belong to the same logical section. Include a small header at the top of each chunk ("Document: X, Section: Y") so the chunk is self-describing when it is retrieved later.

Embedding turns each chunk into a vector you can search over. Pick an embedding model and stick with it — mixing embedding models across a corpus breaks retrieval. The defaults from OpenAI, Voyage, or Cohere are all fine for most use cases. Benchmark them only if you have evidence they are underperforming on your domain.

Stage 2: Retrieval

Retrieval is the step where a user's query gets turned into a set of candidate chunks. The naive version is "embed the query and find the top K nearest chunks by cosine similarity." The naive version is also wrong in most real products.

The first thing to add is hybrid retrieval. Pure vector search misses exact-match queries (a user searching for an invoice number or an error code). Pure keyword search misses semantic queries. Run both in parallel, merge the results, and let the ranking stage sort it out. Postgres with pgvector plus full-text search gives you this in one database, which is usually the right call at startup scale.

The second thing to add is query transformation. Users do not type queries in the shape that matches your corpus. A user asks "how do I fix this?" The query you actually want to run is "how do I fix [specific error message] on [specific product]?" Use a small, cheap model to rewrite the user's query into a retrieval-friendly form before you search.

The third thing to add is metadata filtering. If the user is in the "Acme" workspace, the retrieval should only consider Acme's documents. This sounds obvious and is the source of a surprising number of production bugs where one customer's RAG feature accidentally retrieves another customer's data. Test it explicitly.

Stage 3: Ranking

Retrieval gives you 50 candidate chunks. The model's context window only fits 10. Ranking is how you pick the 10. This step is disproportionately important and disproportionately skipped.

The cheap version of ranking is "use the similarity scores from retrieval." This is fine for a first cut. The better version uses a dedicated re-ranker — a small model specifically trained to score how relevant a chunk is to a query. Cohere, Voyage, and several open-weight options will give you a meaningful quality lift here for minimal cost.

The third layer is diversity. The top 10 chunks by relevance are often variations of the same thing. A good ranking stage deduplicates and picks chunks that together cover the breadth of the query. This is the difference between a model that answers "here are five different angles" and one that answers "here is the same fact five times."

Stage 4: Generation

Generation is where the model finally sees the retrieved chunks and writes an answer. It is the stage everyone focuses on, and the one that matters least for quality if the previous stages are right.

The prompt at this stage has three jobs: tell the model how to use the context, tell it what to do when the context does not answer the question, and constrain the format of the output. A rough template:

You are answering a user's question using only the provided context.
If the context does not contain enough information to answer, say so.
Cite the source of each claim using the [source: X] format.
Format your answer as: [format].

Context:
{chunks}

Question: {query}

This is not sophisticated prompt engineering. It is basic scaffolding. The model does most of the work on its own if the chunks are good. Fancy prompt tricks are the last 5% — fix the pipeline first.

Stage 5: Evaluation

Evaluation is the stage everyone skips and then regrets. Without an eval, you cannot tell whether a change to chunking made the system better or worse. Without an eval, you cannot tell whether a new model version is actually an improvement. Without an eval, you are shipping on vibes.

The minimum viable eval for a RAG pipeline is 50 real questions with expected answers (or expected chunks the system should retrieve). Grade each response on three dimensions: was the right chunk retrieved, did the model use it correctly, and is the output in the right format. You can grade by hand or use an LLM-as-judge; both are fine at this scale.

Run the eval before every significant change. Track the score over time. The eval is the single piece of infrastructure that makes everything else in the pipeline tractable.

The six most common mistakes

From reviewing a lot of RAG pipelines, here are the mistakes I see over and over:

  1. Naive fixed-length chunking that ignores document structure. Fix: chunk on structure, not on tokens.
  2. Pure vector search with no keyword fallback. Fix: hybrid retrieval.
  3. No re-ranking stage. Fix: add a re-ranker, immediately.
  4. No metadata filtering, so users sometimes see other tenants' data. Fix: filter by tenant and test the filter.
  5. No eval, so every change is guesswork. Fix: build the 50-question eval before the third iteration.
  6. Spending all the effort on prompt engineering instead of retrieval quality. Fix: assume the prompt is 5% of the problem.

The cost reality

RAG pipelines are cheaper than fine-tuning and more expensive than simple API calls. A typical cost profile for a RAG query at modest scale: embedding ($0.0001), retrieval (nearly free in Postgres), re-ranking ($0.001), generation ($0.005–0.05 depending on model and context length). Per query, that is usually fractions of a cent. At scale, it adds up — see the hidden cost curve of LLM features for the usual traps.

When RAG is the wrong answer

Not every AI feature is RAG. Three cases where it is not:

Long-context models. If your data is small enough to fit in a model's context window, skip the retrieval stage entirely. Just pass the whole corpus every time. For small corpora, this is cheaper, simpler, and higher-quality than a full RAG pipeline.

Highly structured data. If the user's question is answerable from a database query, use the database. Do not embed SQL rows as vectors and do semantic search over them. That is architecturally absurd and founders still do it because RAG is in vogue.

Fine-grained personalization. If the answers depend on user-specific state that changes constantly, RAG's staleness becomes a problem. You may be better off with direct API calls into your application state.

Counterpoint: start simpler than you think you need to

A warning. It is tempting to build all five stages on day one, with hybrid retrieval and a re-ranker and an LLM-as-judge eval suite. For the first week of an engagement, I will almost always tell the team to start with the simplest possible version — fixed chunking, vector search, no re-ranker, hand-graded eval — and measure how bad it actually is. Usually "bad" is better than expected, and the list of fixes becomes obvious after you have a baseline. Starting simple and iterating beats starting sophisticated and not knowing what to change.

Your next step

If you are building a RAG feature, answer three questions this week: (1) Do we have a 50-question eval? (2) Do we have hybrid retrieval? (3) Do we have a re-ranker? If any of those is no, that is the next thing to build. It will have more impact on quality than any prompt change.

Where I come in

Reviewing or designing RAG pipelines is one of the most common engagements I take on the AI side. Usually 1–2 weeks of focused work, a clear eval, and a set of prioritized changes with measured impact. Book a call if you are building something RAG-shaped and want a second opinion before you ship it to paying customers.


Related reading: Build vs. Buy vs. Wrap: A Founder's Framework for AI Features · The Hidden Cost Curve of LLM Features · Case Study: PoC to Production in 6 Weeks

Building a RAG feature and want it reviewed? Book a call.

Get in touch →