software architecture

RAG in Production: Chunking, Re-ranking, and Hybrid Search (The Deep Dive)

Naive RAG gets you a 70%-quality demo and a plateau. The gap to production is three retrieval levers most teams never pull: chunking on structure (not character counts), hybrid search (vector + keyword), and re-ranking an over-fetched candidate set. The deep dive on each, plus citations and the metrics that tell you where retrieval is failing. Retrieval quality beats model quality.

Ruchit Suthar

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

11 min read
RAG in Production: Chunking, Re-ranking, and Hybrid Search (The Deep Dive)
Key Takeaway

Naive RAG — embed the docs, do a similarity search, stuff the top-3 into the prompt — gets you a 70%-quality demo and a plateau you can't climb past. The gap between that demo and a production RAG system is almost entirely in the retrieval pipeline, and it's three levers most teams never pull: chunking on structure (not character counts), hybrid search (vector + keyword), and re-ranking an over-fetched candidate set. This is the deep dive on each, plus citations and the metrics that tell you where retrieval is failing. The headline you'll hear me repeat: retrieval quality beats model quality. Fix retrieval and a mediocre model gives great answers; ignore it and a frontier model gives vague ones.

RAG in Production: Chunking, Re-ranking, and Hybrid Search (The Deep Dive)


"We're using the best model and the answers are still vague and sometimes wrong." I hear this constantly, and the team always assumes the fix is a better model or more prompt engineering. It almost never is. When I trace a disappointing RAG system, the failure is in retrieval — the model was handed mediocre context and did the best it could with it.

This is the most important and least intuitive thing about building with LLMs: a mediocre model with excellent retrieved context reliably beats a frontier model with poor context. The model can only reason over what you put in front of it. Garbage context, garbage answer — confidently phrased. So the leverage in a RAG system isn't the model. It's the pipeline that decides what context the model sees.

I covered the architecture-level view in designing LLM-powered features. This is the deep dive on the three retrieval levers that actually move quality, in order of how often they're the problem.

The pipeline, and where it leaks

Production retrieval is a pipeline, and each stage is a place to win or lose quality.

Naive RAG skips the rewrite, does pure vector search, fetches a fixed top-3, and skips re-ranking. Each omission silently caps your quality. Let's fix them.

Lever 1: Chunking — the highest-leverage, most-neglected decision

How you split documents determines what can possibly be retrieved. It's an upstream decision that bounds everything downstream, and most teams give it five minutes.

The naive approach and why it fails: splitting on a fixed character count (say, every 1,000 characters). This cuts sentences in half, separates a heading from the content it introduces, and splits a single coherent idea across two chunks so neither is retrievable as a complete thought.

What works:

  • Chunk on structure, not size. Split on natural boundaries — headings, sections, paragraphs, list items. A chunk should be a self-contained idea, not an arbitrary slice.
  • Add overlap. Let adjacent chunks share a little content (a sentence or two) so an idea spanning a boundary survives in at least one chunk.
  • Right-size to the content. Too small loses context; too large dilutes relevance and wastes tokens. Match chunk size to how the information is actually structured — a tight FAQ chunks differently from a long design doc.
  • Carry metadata. Attach source, section, title, and IDs to each chunk. You'll need them for filtering, citations, and debugging.

When I turn around a "vague answers" RAG system, the fix is chunking more often than anything else. It's unglamorous and it's the highest ROI.

Lever 2: Hybrid search — because users paste exact strings

Pure vector (semantic) search is the RAG default and it has a blind spot: exact matches. Embeddings capture meaning, so they're great for "how do I reset my password" matching a doc titled "credential recovery." But they're worse than old-fashioned keyword search when the user pastes an error code, a product SKU, a function name, a proper noun, or an acronym — the exact tokens that should match perfectly.

Real users do this constantly. They paste ERR_CONN_4012. They search for getUserById. Pure semantic search shrugs because those tokens don't have meaningful embeddings.

Hybrid search runs both a vector search and a keyword/BM25 search and merges the results (commonly with Reciprocal Rank Fusion). You get semantic understanding and exact-match precision. For most real corpora — anything with codes, names, or technical terms — hybrid meaningfully outperforms pure vector, and it's a well-supported feature in modern vector databases.

The rule: if your users ever paste exact strings — and they do — you want hybrid search.

Lever 3: Re-ranking — over-fetch, then be picky

Here's the move that often improves answer quality more than upgrading the LLM, and it's cheap. The first retrieval pass (vector or hybrid) optimizes for recall — casting a wide net to make sure the right chunk is somewhere in the results. But the first-pass ranking is rough; the best chunk might be #8, and if you only pass the top-3 to the model, you missed it.

Re-ranking fixes this in two steps:

  1. Over-fetch. Retrieve more candidates than you need — say top-20 instead of top-3.
  2. Re-rank with a cross-encoder. Run a re-ranking model that scores each candidate's relevance to the query directly (a cross-encoder reads the query and chunk together, far more accurate than comparing embeddings) and keep the best 3-5.

The intuition: cheap, broad retrieval to find candidates, then expensive, accurate re-ranking to order them. You feed the model fewer, better chunks — which improves answers and reduces token cost. Adding a re-ranker is one of the highest-return changes you can make to a RAG system.

Don't forget citations

Every claim the model makes should trace back to a retrieved chunk. Carry the source IDs through the whole pipeline so the final answer can link to where each fact came from. Citations are three things at once: your strongest defense against hallucination (the model is anchored to real sources), your debugging tool (when an answer is wrong, you see exactly which chunk misled it), and a trust signal for users. A RAG answer with no citations is an assertion; with citations it's evidence.

Measure retrieval separately from generation

The most useful diagnostic discipline in RAG: when an answer is bad, determine which stage failed. Two very different failures hide behind "the answer was wrong":

  • Retrieval failure — the right chunk wasn't in the retrieved set. No model can fix this; the fix is chunking, hybrid search, or re-ranking.
  • Generation failure — the right chunk was retrieved but the model ignored it, misread it, or hallucinated past it. The fix is the prompt or the model.

Measure them separately (this is exactly what an eval set is for): track retrieval metrics (was the right context in the top-K?) independently from answer faithfulness. Otherwise you'll tune the prompt for weeks when the problem was that the document never got retrieved.

What to do Monday morning

  1. Audit your chunking first. Are you splitting on character count or on structure? If it's character count, switch to structure-based chunks with overlap — it's usually the biggest single win.

  2. Turn on hybrid search. If you're doing pure vector search and your domain has any codes, names, or technical terms, add keyword/BM25 and fuse the results. Test with queries that paste exact strings.

  3. Add a re-ranker. Over-fetch top-20, re-rank with a cross-encoder, pass the best 3-5. Cheap, high-return, often beats a model upgrade.

  4. Instrument retrieval vs generation. Build the ability to answer "was the right chunk retrieved?" separately from "did the model use it well?" so you fix the right stage.

Key takeaways

  • Retrieval quality beats model quality. A mediocre model with excellent context beats a frontier model with poor context. The leverage in RAG is the pipeline that decides what the model sees, not the model.

  • Chunking is the highest-leverage, most-neglected lever. Split on document structure with overlap so each chunk is a coherent, retrievable idea — never on fixed character counts that bisect ideas. It's the most common fix for "vague answers."

  • Hybrid search handles the exact strings users actually paste. Pure vector search misses error codes, SKUs, function names, and acronyms; combining it with keyword/BM25 gives both meaning and precision.

  • Re-rank an over-fetched set. Retrieve broadly for recall (top-20), then use a cross-encoder to keep the best few for precision. It often beats upgrading the LLM and reduces token cost.

  • Cite sources and measure retrieval separately from generation. Citations defend against hallucination and aid debugging; separating retrieval failures from generation failures tells you which stage to actually fix.

Your next step

Take five recent RAG answers that were disappointing and, for each, ask one question: was the right source chunk even retrieved? If it wasn't in most cases, your problem is retrieval — chunking, hybrid search, or re-ranking — and no amount of prompt tweaking or model upgrading will fix it. That single diagnostic redirects most teams from polishing the model to fixing the pipeline, which is where the quality actually lives.

Frequently asked questions

Why are my RAG answers vague or wrong even with the best model?

Almost always because of retrieval, not the model. RAG answers are only as good as the context retrieved and placed in the prompt — a mediocre model with excellent context beats a frontier model with poor context. The usual culprits are chunking that splits ideas across boundaries, pure vector search that misses exact terms, and the absence of re-ranking so the best chunk never reaches the model. Diagnose by checking whether the correct source chunk was even retrieved; if it wasn't, prompt tweaking and model upgrades won't help.

How should I chunk documents for RAG?

Chunk on document structure — headings, sections, paragraphs, list items — so each chunk is a self-contained, coherent idea, rather than splitting on a fixed character count that bisects sentences and separates headings from their content. Add a small overlap between adjacent chunks so an idea spanning a boundary survives in at least one, right-size chunks to how the content is actually structured (a tight FAQ differs from a long design doc), and attach metadata (source, section, IDs) for filtering, citations, and debugging. Chunking is frequently the single highest-impact fix for poor RAG quality.

What is hybrid search and why does it matter for RAG?

Hybrid search combines vector (semantic) search with keyword/BM25 search and merges the rankings, typically via a fusion method like Reciprocal Rank Fusion. It matters because pure vector search captures meaning but misses exact matches — error codes, product SKUs, function names, acronyms, proper nouns — which real users paste constantly. Hybrid search gives you semantic understanding plus exact-term precision, and for most real-world corpora containing technical terms it meaningfully outperforms pure vector search.

What is re-ranking in a RAG pipeline?

Re-ranking is a two-step retrieval refinement: first over-fetch a broad candidate set optimized for recall (for example top-20), then run a cross-encoder re-ranking model that scores each candidate's relevance to the query directly and keep only the best few (3-5). A cross-encoder reads the query and chunk together, making it far more accurate at ordering than the initial embedding-similarity pass. Re-ranking feeds the model fewer, better chunks — which often improves answer quality more than upgrading the LLM while also reducing token cost.

How do I tell if a RAG problem is retrieval or generation?

Measure the two stages separately. A retrieval failure means the correct chunk wasn't in the retrieved set — no model can compensate, and the fix is chunking, hybrid search, or re-ranking. A generation failure means the right chunk was retrieved but the model ignored, misread, or hallucinated past it — the fix is the prompt or model. Track retrieval metrics (was the right context in the top-K?) independently from answer faithfulness, ideally via an eval set, so you don't spend weeks tuning the prompt when the document was never retrieved in the first place.

#software-architecture#rag#llm#vector-database#ai#retrieval#hybrid-search#reranking#ai-engineering#2026
Ruchit Suthar

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

Continue Reading

Designing LLM-Powered Features: RAG, Vector Databases, and the New System-Design Checklist
software architecture

Designing LLM-Powered Features: RAG, Vector Databases, and the New System-Design Checklist

Adding an LLM to your product is a distributed-systems problem with a non-deterministic dependency, not a single API call. When RAG actually helps (and when a prompt will do), how to think about vector databases and chunking without cargo-culting, the retrieval pipeline that separates demos from products, and the seven-point production checklist — evals, guardrails, cost ceilings, latency budgets, fallbacks, observability, and a human-in-the-loop boundary — to put in place before a real user touches it.

·15 min read
Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency
software architecture

Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency

You can't ship a reliable LLM feature on vibes. Evals are the regression net for a dependency that's non-deterministic, drifts when the provider updates the model, and fails silently. How to build one without boiling the ocean: start with 30 real examples, layer three kinds of checks (assertion, LLM-as-judge, human), measure faithfulness, and run it on every prompt, model, and retrieval change.

·11 min read
MCP Servers Explained: Giving Your AI Tools Real Context (A Practical Setup)
developer productivity

MCP Servers Explained: Giving Your AI Tools Real Context (A Practical Setup)

The number one reason AI coding agents produce confident, wrong code is they're guessing about your system. MCP (Model Context Protocol) fixes that — a standard way for agents to pull real context from real sources instead of you copy-pasting it. What MCP is (a USB-C port for AI tools), how to set up your first server, which context to expose (schema, docs, issues) and what to keep out, and the security model you must get right.

·12 min read