LLM Architecture in Production: RAG, Vector Databases, and the 7-Point System-Design Checklist

Q: When should I use RAG versus fine-tuning an LLM?

Use RAG when the model needs access to large, changing, or proprietary knowledge that it should answer *from* — documentation, account data, a knowledge base. Use fine-tuning to change the model's *behavior*: output format, tone, adherence to a structure, or a specialized task style. They solve different problems and are often combined. The common, expensive mistake is fine-tuning to teach facts — facts belong in the retrieved context, because fine-tuning bakes information into weights where it can't be updated or cited and tends to be recalled unreliably.

Q: Do I need a dedicated vector database for RAG?

Not at first. If you already run PostgreSQL and have under roughly a million vectors, `pgvector` keeps everything in one system with transactional consistency and is enough for most applications. Move to a dedicated vector database (Qdrant, Weaviate, Milvus, Pinecone) when you reach many millions of vectors, need high-performance metadata filtering, or want built-in hybrid search and quantization. When you evaluate them, benchmark filtered queries ("similar chunks where tenant = X") rather than pure similarity search, because filtered queries are what production workloads actually run.

Q: Why are my RAG answers vague or wrong even with a good model?

The most common cause is retrieval, not the model. Check chunking first (split on document structure with slight overlap rather than fixed character counts), add query rewriting so vague user phrasing maps to useful searches, and add a re-ranking step that over-fetches candidates and selects the best few. Use hybrid (vector + keyword) search so exact terms like error codes and product names match. A mediocre model with excellent retrieved context reliably beats a frontier model fed poor context.

Q: How do I control the cost of an LLM feature in production?

Set per-user and per-tenant rate limits and budget caps so usage — including adversarial usage — can't run unbounded. Add caching, including semantic caching for similar queries, which commonly cuts costs 30-50%. Route by difficulty so a small cheap model handles easy requests and the expensive model is reserved for hard ones. Track cost per request in your observability so a single feature can't silently multiply your bill.

Q: How do I prevent prompt injection in an LLM feature?

Treat the model and its inputs as untrusted. Validate and constrain user input, and critically, treat *retrieved content* as a potential injection vector since it can contain instructions. Screen model output before acting on it, and never let raw model output trigger a privileged operation without validation. If the model can invoke tools, scope those tools tightly — their permissions define the blast radius of a successful injection, so treat them like a service account with least privilege.

Q: What's the minimum I need before launching an LLM feature?

At minimum: an eval set to detect regressions, input and output guardrails, a cost ceiling with rate limits, a latency budget with streaming and timeouts, a fallback for provider outages, full-chain observability so you can replay any request, and an explicit decision about where the AI acts versus only suggests. Most pre-launch features have only a happy path and maybe rate limiting — each missing item is a likely production incident.

Q: What is LLM architecture?

LLM architecture is the system design around a large language model — retrieval, evals, guardrails, cost control, latency budgets, fallbacks, and observability — that turns a single model API call into a reliable production feature. The model itself is one component; the architecture is what makes its output trustworthy, affordable, and safe at scale. Most of the engineering in a production LLM feature lives in this surrounding system, not the model call.

Q: What is RAG architecture?

RAG (retrieval-augmented generation) architecture is a two-phase design. Offline, documents are chunked, embedded, and stored in a vector database. Online, a user query is rewritten, matched against that store, re-ranked, and the best results are assembled into the prompt the LLM answers from — with citations back to the source chunks. It grounds answers in your current, proprietary data instead of the model's training data.

Q: What is the difference between RAG and an AI agent?

RAG retrieves context and answers in a single pass. An agent plans, calls tools — search, APIs, databases — often through a protocol like MCP, and loops on the results to take actions rather than just answer. Agents add capability and blast radius at the same time: scope their tools like a least-privilege service account and keep a human-in-the-loop boundary in front of anything that writes, sends, or spends.

Q: How do production AI systems prevent hallucinations?

They ground answers in retrieved context (RAG) rather than the model's memory, return citations so every claim is traceable to a source, and measure groundedness and faithfulness in their eval set. Retrieval quality is the lever that matters most: re-ranking, hybrid (vector + keyword) search, and good chunking get the right context in front of the model, and output guardrails screen anything unsupported before it reaches the user.

Adding an LLM to your product is a distributed-systems problem with a non-deterministic dependency, not a single API call. When RAG actually helps (and when a prompt will do), how to think about vector databases and chunking without cargo-culting, the retrieval pipeline that separates demos from products, and the seven-point production checklist — evals, guardrails, cost ceilings, latency budgets, fallbacks, observability, and a human-in-the-loop boundary — to put in place before a real user touches it.

June 2, 202620 min read

LLM Architecture in Production: RAG, Vector Databases, and the 7-Point System-Design Checklist

✦

Key Takeaway

Adding an LLM to your product is not "call an API and ship." It's a distributed system with a non-deterministic dependency that can be slow, expensive, wrong, and occasionally adversarial — and most teams discover this in production. This is the system-design playbook I use: when RAG actually helps (and when it's overkill), how to think about vector databases and chunking without cargo-culting, the retrieval pipeline that separates demos from products, and the seven-point checklist — evals, guardrails, cost ceilings, latency budgets, fallbacks, observability, and a human-in-the-loop boundary — that you need *before* the feature touches a real user. The model is the easy part. The system around it is the architecture.

LLM Architecture in Production: RAG, Vector Databases, and the 7-Point System-Design Checklist

January 2026. A team shows me their new "AI assistant" feature in a demo. It's beautiful. You type a question about your account, it answers in plain language, pulls in your real data, sounds human. Everyone in the room is impressed. Then I ask three questions.

"What happens when the model is down or rate-limited?" — Silence. "It just... errors?"

"What does this cost at 10,000 queries a day?" — Someone opens a calculator. The number makes the VP of Engineering wince.

"How do you know if it's getting worse after you change the prompt next week?" — "We... read some of the outputs?"

The demo worked. The system didn't exist yet. And that's the gap that sinks most LLM features: the model is a five-minute API call, but the architecture around it — retrieval, evaluation, guardrails, cost control, fallbacks, observability — is where the actual engineering lives. Teams spend 90% of their effort on the 10% that's easy.

I've now shipped and reviewed enough of these to have a repeatable approach. This is it: the design decisions in the order you should make them, the patterns that hold up under load, and the checklist that turns an impressive demo into something you can put your name on in production.

LLM architecture is the system design around a large language model — retrieval, evaluation, guardrails, cost control, latency, fallbacks, and observability — that turns a single API call into a reliable production feature. The model is one component; the architecture is what makes its output trustworthy, affordable, and safe at scale, and the dominant pattern for grounding it in your own data is RAG (retrieval-augmented generation): fetch relevant context at query time and feed it to the model.

First decision: do you even need RAG?

RAG — Retrieval-Augmented Generation — means fetching relevant context at query time and stuffing it into the prompt so the model answers from your data instead of its training data. It's the default architecture for "chat with your docs / your data" features. It's also reached for reflexively when simpler things would do.

Before you build a retrieval pipeline, walk down this ladder and stop at the first rung that solves your problem:

The point: RAG is the right answer to "large, changing knowledge, queried in unpredictable ways." If your knowledge fits in a prompt and rarely changes, you don't need a vector database — you need a system message. I've watched teams stand up Pinecone clusters to retrieve from a 12-page FAQ that would have fit in the prompt with room to spare.

Approach	Use when	Avoid when
Prompting	Knowledge fits the context window and rarely changes	Knowledge is large or changes often
RAG	Large, changing, proprietary knowledge queried unpredictably	A system prompt would already fit the data
Fine-tuning	You need consistent format, tone, or a specialized task style	You're trying to teach facts (use RAG)
Agents	Answering needs multi-step reasoning, tools, or actions	A single retrieve-then-answer pass is enough

And a myth worth killing: fine-tuning is not how you teach a model facts. Fine-tuning teaches format, tone, and behavior. Facts go in the context. Confusing the two leads to expensive training runs that hallucinate confidently in your brand voice.

Because "RAG vs fine-tuning" is the question I get asked most, here's the comparison in one place:

Criteria	RAG	Fine-Tuning
Best for	Injecting knowledge/facts	Changing behavior, format, tone
Data freshness	Live — update the index	Frozen into weights at train time
Citations	Yes — traceable to sources	No — facts baked in, unciteable
Update cost	Re-embed changed docs	Re-train the model
Hallucination control	High (grounded in retrieved context)	Low (recalls facts unreliably)
Upfront cost	Pipeline engineering	GPU/training runs
Typical use	Chat-with-your-docs, support, search	Structured output, domain style, classification

RAG and fine-tuning aren't rivals — production systems often fine-tune for format and use RAG for facts. The mistake is reaching for fine-tuning to solve a knowledge problem.

Production RAG is a pipeline, not a similarity search

If you do need RAG, the demo version — "embed the docs, do a similarity search, stuff the top 3 into the prompt" — will get you 70% quality and a plateau you can't climb past. Production retrieval is a pipeline with stages, and each stage is a place to win or lose quality.

The decisions that matter most, in order of impact. For a deeper treatment of each lever, see RAG in Production: Chunking, Re-ranking, and Hybrid Search.

1. Chunking is the highest-leverage and most-neglected decision. How you split documents determines what can possibly be retrieved. Chunk too small and you lose context; too large and you dilute relevance and waste tokens. Don't split on a fixed character count — split on structure (headings, sections, semantic boundaries) and overlap chunks slightly so an idea isn't cut in half. When I've turned around a "RAG that gives vague answers," the fix was chunking far more often than the model.

2. Retrieval quality > model quality, usually. A mediocre model with excellent retrieved context beats a frontier model fed garbage. Two cheap, high-impact upgrades over naive similarity search: query rewriting (expand or clarify the user's question before searching — "it broke" retrieves nothing useful) and re-ranking (over-fetch, say top-20, then use a cross-encoder re-ranker to pick the best 3-5). Re-ranking alone often moves answer quality more than upgrading the LLM.

3. Hybrid search beats pure vector search for most real corpora. Pure semantic similarity misses exact matches — product codes, error numbers, names, acronyms. Combine vector search with keyword/BM25 search and merge the results. Your users will paste exact strings, and pure embeddings will shrug.

4. Always return citations. Every claim the model makes should be traceable to a retrieved chunk. This is your single best defense against hallucination, your debugging tool when answers are wrong, and a trust signal for users. Architecturally: carry source IDs through the pipeline so the final answer can link back.

Measuring retrieval: Precision@K and Recall@K

Measure retrieval before you measure answers. Two metrics tell you whether the generator even had a chance: Precision@K — how many of the top-K retrieved chunks are actually relevant — and Recall@K — how many of the relevant chunks made it into the top-K at all. If recall is low, no amount of prompt engineering will save the answer, because the right context never reached the model. When answers are wrong, this is where to look first: a retrieval-quality problem masquerading as a generation problem. Track both against a labeled query set the same way you track evals.

Choosing (and not over-choosing) a vector database

The vector database question generates more anxious Slack threads than it deserves. Here's the honest framing.

Start with pgvector if you already run Postgres and have under ~a million vectors. It keeps your data in one place, supports transactional consistency with your metadata, and removes an entire system from your operational surface. The "you need a dedicated vector DB" advice is usually written by vendors of dedicated vector DBs.

Move to a dedicated vector database (Qdrant, Weaviate, Milvus, Pinecone) when you have real scale (many millions of vectors), need sophisticated metadata filtering at speed, or want features like hybrid search and quantization built in. The thing that actually matters across all of them is metadata filtering performance — in production you're almost never doing pure similarity search; you're doing "similar chunks where tenant = X and access_level <= Y and updated_at > Z." Benchmark filtered queries, not the clean ones in the marketing graphs.

Requirement	Recommended approach
< ~1M vectors, already run Postgres	`pgvector` — one less system to operate
Millions–billions of vectors	Dedicated vector DB (Qdrant, Weaviate, Milvus, Pinecone)
Heavy metadata filtering at speed	Dedicated DB; benchmark filtered queries, not clean ones
Built-in hybrid search + quantization	Dedicated DB (Qdrant/Weaviate)
Transactional consistency with app data	`pgvector`
No ops appetite	Managed service (Pinecone, managed Qdrant/Weaviate)
Full control / data residency	Self-hosted open-source (Qdrant, Weaviate, Milvus)

And don't forget the boring part: embeddings change. The day you switch embedding models, every vector is stale. Build re-embedding as a first-class, resumable batch job from day one, not a heroic weekend.

Permission-aware retrieval and multi-tenant RAG

That tenant = X and access_level <= Y filter isn't only a performance concern — it's a security boundary. In multi-tenant RAG, a missing or mis-applied tenant filter leaks one customer's documents into another customer's answer, and the model will happily summarize data it should never have seen. The fix is permission-aware retrieval: enforce access control at the retrieval layer, as part of the query, not as a post-generation filter on the output. The model can't leak what it was never retrieved. Treat every retrieval call as running under the requesting user's permissions, and test the negative case — that a user cannot retrieve another tenant's chunks — as a first-class eval, not an afterthought.

The seven-point checklist I won't ship without

This is the part that doesn't make the demo and entirely determines whether the feature survives contact with users. I don't sign off on an LLM feature until all seven exist.

You cannot improve what you can't measure, and "we read some outputs" is not measurement. Before launch, build an eval set: a few dozen to a few hundred representative inputs with either gold answers or a scoring rubric. Run it on every prompt change, model swap, and retrieval tweak. This is the single practice that separates teams who improve their LLM feature from teams who change things and hope.

Three layers, cheapest first: assertion-based (does the output contain the citation? valid JSON? under length?), LLM-as-judge (a model scores relevance/faithfulness against a rubric — cheap, surprisingly good, validate it against humans periodically), and human review on a sample for the things only humans catch. Treat your eval set like a test suite: it's the regression net for a non-deterministic dependency. See the full eval-building guide for how to set one up without boiling the ocean.

2. Guardrails — input and output

The model will be fed adversarial input (prompt injection, especially when retrieved content itself can carry instructions) and will occasionally produce output you can't ship (PII, toxic content, off-topic answers, leaked system prompt). Guard both ends: validate and constrain input, and screen output before it reaches the user. Treat the LLM as untrusted — never let its raw output trigger a privileged action without validation. If the model can call tools, the blast radius of a successful injection is whatever those tools can do. Scope them like you'd scope a service account.

3. A cost ceiling you control

Token costs scale with usage and, unlike most infra, with adversarial usage — someone can run up your bill on purpose. Set per-user and per-tenant rate limits and budget caps. Cache aggressively: identical and semantically-similar queries shouldn't pay twice (semantic caching can cut cost 30-50% on real traffic). Route by difficulty — a small cheap model for easy queries, the expensive one only when needed. I've seen a single unbounded "summarize everything" feature 5x an inference bill in a week.

4. A latency budget and a streaming UX

LLM calls are slow — seconds, not milliseconds — and retrieval adds more. Decide your budget and design for it: stream tokens so perceived latency drops even when total time doesn't, run retrieval and other prep in parallel, and set timeouts with a real fallback. A spinner for eight seconds reads as broken; the same eight seconds streaming reads as thinking.

5. Fallbacks for when the model is unavailable

Your LLM provider will have outages, rate-limit you, and deprecate model versions on their schedule. Design for it: a secondary provider or model, a degraded-but-useful non-AI path (return the raw retrieved documents with a "summary unavailable" notice instead of a hard error), and circuit breakers so a provider incident doesn't cascade. Abstract the provider behind your own interface so swapping isn't a rewrite — this is hexagonal architecture applied to a flaky dependency.

6. Observability built for non-determinism

Standard APM doesn't capture what you need to debug an LLM feature. Log the full chain for every request: input, rewritten query, retrieved chunks with scores, final prompt, raw output, tokens, latency, and cost. When a user reports a bad answer, you need to replay exactly what happened — which chunk misled it, whether retrieval or generation failed. Without this, every bug report is unfalsifiable.

7. A human-in-the-loop boundary

Decide explicitly where the AI is allowed to act versus only suggest. Drafting an email a human sends? Low stakes, let it run. Issuing a refund, changing a config, sending a message on the user's behalf? That needs a confirmation step. The boundary between "AI suggests, human approves" and "AI acts autonomously" is a deliberate architectural decision tied to blast radius — make it on purpose, not by default.

When the LLM feature becomes an agent: tool calling and MCP

When answering needs actions, not just a retrieved paragraph, plain RAG becomes an agentic workflow: the model plans, calls tools — search, internal APIs, database queries — and loops on the results until it has an answer. Tool calling is what turns the LLM from a text generator into an actor in your system, and it's also where blast radius explodes. The Model Context Protocol (MCP) standardizes how models discover and invoke tools across providers, so the same agent can reach a filesystem, a ticketing API, and your search index through one interface.

The architectural rule doesn't change — it intensifies. Every tool the model can call, and every MCP server it can reach, is part of its attack surface. A successful prompt injection inherits exactly the permissions of the tools you exposed, so scope each one like a least-privilege service account, log every call, and keep the human-in-the-loop boundary (checklist item 7) firmly in front of any tool that writes, sends, or spends. An agent is a more capable architecture and a larger liability in the same step.

A reference architecture

Putting it together, here's the shape of a production LLM feature:

Notice how little of this is "the model." The model is one box. The architecture is everything around it — and that's where your engineering judgment shows up.

What to do Monday morning

If you have an LLM feature in production or in progress:

Run the RAG ladder against it. Are you using retrieval where a system prompt would do? Are you fine-tuning to inject facts? Cut what you don't need.
Build a 30-example eval set this week. Thirty real inputs with expected qualities. Run it once to get a baseline. This one artifact will change how your team works on the feature — you'll stop guessing.
Audit the seven-point checklist. Score your feature 0-7. Most pre-launch features I review score 2-3 (they have a happy path and maybe rate limiting). Every missing point is a production incident waiting for a date.
Add full-chain logging if you don't have it. You cannot debug what you can't replay. This is the prerequisite for everything else.

Take your current or planned LLM feature and write its reference architecture as a single diagram — every box between the user and the model and back. If any of the seven checklist items has no box, you've just found your next sprint. The teams that win with AI aren't the ones with access to the best model; everyone has that now. They're the ones who built the system around it.

The model writes the answer. You design the system that makes the answer trustworthy. That's still the job.

Frequently asked questions

When should I use RAG versus fine-tuning an LLM?

Use RAG when the model needs access to large, changing, or proprietary knowledge that it should answer from — documentation, account data, a knowledge base. Use fine-tuning to change the model's behavior: output format, tone, adherence to a structure, or a specialized task style. They solve different problems and are often combined. The common, expensive mistake is fine-tuning to teach facts — facts belong in the retrieved context, because fine-tuning bakes information into weights where it can't be updated or cited and tends to be recalled unreliably.

Do I need a dedicated vector database for RAG?

Not at first. If you already run PostgreSQL and have under roughly a million vectors, pgvector keeps everything in one system with transactional consistency and is enough for most applications. Move to a dedicated vector database (Qdrant, Weaviate, Milvus, Pinecone) when you reach many millions of vectors, need high-performance metadata filtering, or want built-in hybrid search and quantization. When you evaluate them, benchmark filtered queries ("similar chunks where tenant = X") rather than pure similarity search, because filtered queries are what production workloads actually run.

Why are my RAG answers vague or wrong even with a good model?

The most common cause is retrieval, not the model. Check chunking first (split on document structure with slight overlap rather than fixed character counts), add query rewriting so vague user phrasing maps to useful searches, and add a re-ranking step that over-fetches candidates and selects the best few. Use hybrid (vector + keyword) search so exact terms like error codes and product names match. A mediocre model with excellent retrieved context reliably beats a frontier model fed poor context.

How do I control the cost of an LLM feature in production?

Set per-user and per-tenant rate limits and budget caps so usage — including adversarial usage — can't run unbounded. Add caching, including semantic caching for similar queries, which commonly cuts costs 30-50%. Route by difficulty so a small cheap model handles easy requests and the expensive model is reserved for hard ones. Track cost per request in your observability so a single feature can't silently multiply your bill.

How do I prevent prompt injection in an LLM feature?

Treat the model and its inputs as untrusted. Validate and constrain user input, and critically, treat retrieved content as a potential injection vector since it can contain instructions. Screen model output before acting on it, and never let raw model output trigger a privileged operation without validation. If the model can invoke tools, scope those tools tightly — their permissions define the blast radius of a successful injection, so treat them like a service account with least privilege.

What's the minimum I need before launching an LLM feature?

At minimum: an eval set to detect regressions, input and output guardrails, a cost ceiling with rate limits, a latency budget with streaming and timeouts, a fallback for provider outages, full-chain observability so you can replay any request, and an explicit decision about where the AI acts versus only suggests. Most pre-launch features have only a happy path and maybe rate limiting — each missing item is a likely production incident.

What is LLM architecture?

LLM architecture is the system design around a large language model — retrieval, evals, guardrails, cost control, latency budgets, fallbacks, and observability — that turns a single model API call into a reliable production feature. The model itself is one component; the architecture is what makes its output trustworthy, affordable, and safe at scale. Most of the engineering in a production LLM feature lives in this surrounding system, not the model call.

What is RAG architecture?

RAG (retrieval-augmented generation) architecture is a two-phase design. Offline, documents are chunked, embedded, and stored in a vector database. Online, a user query is rewritten, matched against that store, re-ranked, and the best results are assembled into the prompt the LLM answers from — with citations back to the source chunks. It grounds answers in your current, proprietary data instead of the model's training data.

What is the difference between RAG and an AI agent?

RAG retrieves context and answers in a single pass. An agent plans, calls tools — search, APIs, databases — often through a protocol like MCP, and loops on the results to take actions rather than just answer. Agents add capability and blast radius at the same time: scope their tools like a least-privilege service account and keep a human-in-the-loop boundary in front of anything that writes, sends, or spends.

How do production AI systems prevent hallucinations?

They ground answers in retrieved context (RAG) rather than the model's memory, return citations so every claim is traceable to a source, and measure groundedness and faithfulness in their eval set. Retrieval quality is the lever that matters most: re-ranking, hybrid (vector + keyword) search, and good chunking get the right context in front of the model, and output guardrails screen anything unsupported before it reaches the user.

Runnable reference: The patterns in this article — provider abstraction, cost tracking, rate limiting, and fallback — are implemented in packages/ai-gateway inside the ai-native-app-blueprint repo. Clone it, read the ADR, run the tests.

#software-architecture#llm#rag#vector-database#ai-architecture#system-design#mcp#ai-engineering#pgvector#2026

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

LLM Architecture in Production: RAG, Vector Databases, and the 7-Point System-Design Checklist

LLM Architecture in Production: RAG, Vector Databases, and the 7-Point System-Design Checklist

First decision: do you even need RAG?

Production RAG is a pipeline, not a similarity search

Measuring retrieval: Precision@K and Recall@K

Choosing (and not over-choosing) a vector database

Permission-aware retrieval and multi-tenant RAG

The seven-point checklist I won't ship without

1. Evals — or you're flying blind

2. Guardrails — input and output

3. A cost ceiling you control

4. A latency budget and a streaming UX

5. Fallbacks for when the model is unavailable

6. Observability built for non-determinism

7. A human-in-the-loop boundary

When the LLM feature becomes an agent: tool calling and MCP

A reference architecture

What to do Monday morning

Frequently asked questions

When should I use RAG versus fine-tuning an LLM?

Do I need a dedicated vector database for RAG?

Why are my RAG answers vague or wrong even with a good model?

How do I control the cost of an LLM feature in production?

How do I prevent prompt injection in an LLM feature?

What's the minimum I need before launching an LLM feature?

What is LLM architecture?

What is RAG architecture?

What is the difference between RAG and an AI agent?

How do production AI systems prevent hallucinations?

Continue Reading

Retrieval Failure vs Generation Failure: How to Diagnose Which Layer Is Killing Your RAG System

RAG in Production: Chunking, Re-ranking, and Hybrid Search (The Deep Dive)

Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency