software architecture

Designing LLM-Powered Features: RAG, Vector Databases, and the New System-Design Checklist

Adding an LLM to your product is a distributed-systems problem with a non-deterministic dependency, not a single API call. When RAG actually helps (and when a prompt will do), how to think about vector databases and chunking without cargo-culting, the retrieval pipeline that separates demos from products, and the seven-point production checklist — evals, guardrails, cost ceilings, latency budgets, fallbacks, observability, and a human-in-the-loop boundary — to put in place before a real user touches it.

Ruchit Suthar

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

15 min read
Designing LLM-Powered Features: RAG, Vector Databases, and the New System-Design Checklist
Key Takeaway

Adding an LLM to your product is not "call an API and ship." It's a distributed system with a non-deterministic dependency that can be slow, expensive, wrong, and occasionally adversarial — and most teams discover this in production. This is the system-design playbook I use: when RAG actually helps (and when it's overkill), how to think about vector databases and chunking without cargo-culting, the retrieval pipeline that separates demos from products, and the seven-point checklist — evals, guardrails, cost ceilings, latency budgets, fallbacks, observability, and a human-in-the-loop boundary — that you need *before* the feature touches a real user. The model is the easy part. The system around it is the architecture.

Designing LLM-Powered Features: RAG, Vector Databases, and the New System-Design Checklist


January 2026. A team shows me their new "AI assistant" feature in a demo. It's beautiful. You type a question about your account, it answers in plain language, pulls in your real data, sounds human. Everyone in the room is impressed. Then I ask three questions.

"What happens when the model is down or rate-limited?" — Silence. "It just... errors?"

"What does this cost at 10,000 queries a day?" — Someone opens a calculator. The number makes the VP of Engineering wince.

"How do you know if it's getting worse after you change the prompt next week?" — "We... read some of the outputs?"

The demo worked. The system didn't exist yet. And that's the gap that sinks most LLM features: the model is a five-minute API call, but the architecture around it — retrieval, evaluation, guardrails, cost control, fallbacks, observability — is where the actual engineering lives. Teams spend 90% of their effort on the 10% that's easy.

I've now shipped and reviewed enough of these to have a repeatable approach. This is it: the design decisions in the order you should make them, the patterns that hold up under load, and the checklist that turns an impressive demo into something you can put your name on in production.

First decision: do you even need RAG?

RAG — Retrieval-Augmented Generation — means fetching relevant context at query time and stuffing it into the prompt so the model answers from your data instead of its training data. It's the default architecture for "chat with your docs / your data" features. It's also reached for reflexively when simpler things would do.

Before you build a retrieval pipeline, walk down this ladder and stop at the first rung that solves your problem:

The point: RAG is the right answer to "large, changing knowledge, queried in unpredictable ways." If your knowledge fits in a prompt and rarely changes, you don't need a vector database — you need a system message. I've watched teams stand up Pinecone clusters to retrieve from a 12-page FAQ that would have fit in the prompt with room to spare.

And a myth worth killing: fine-tuning is not how you teach a model facts. Fine-tuning teaches format, tone, and behavior. Facts go in the context. Confusing the two leads to expensive training runs that hallucinate confidently in your brand voice.

The retrieval pipeline that actually works

If you do need RAG, the demo version — "embed the docs, do a similarity search, stuff the top 3 into the prompt" — will get you 70% quality and a plateau you can't climb past. Production retrieval is a pipeline with stages, and each stage is a place to win or lose quality.

The decisions that matter most, in order of impact:

1. Chunking is the highest-leverage and most-neglected decision. How you split documents determines what can possibly be retrieved. Chunk too small and you lose context; too large and you dilute relevance and waste tokens. Don't split on a fixed character count — split on structure (headings, sections, semantic boundaries) and overlap chunks slightly so an idea isn't cut in half. When I've turned around a "RAG that gives vague answers," the fix was chunking far more often than the model.

2. Retrieval quality > model quality, usually. A mediocre model with excellent retrieved context beats a frontier model fed garbage. Two cheap, high-impact upgrades over naive similarity search: query rewriting (expand or clarify the user's question before searching — "it broke" retrieves nothing useful) and re-ranking (over-fetch, say top-20, then use a cross-encoder re-ranker to pick the best 3-5). Re-ranking alone often moves answer quality more than upgrading the LLM.

3. Hybrid search beats pure vector search for most real corpora. Pure semantic similarity misses exact matches — product codes, error numbers, names, acronyms. Combine vector search with keyword/BM25 search and merge the results. Your users will paste exact strings, and pure embeddings will shrug.

4. Always return citations. Every claim the model makes should be traceable to a retrieved chunk. This is your single best defense against hallucination, your debugging tool when answers are wrong, and a trust signal for users. Architecturally: carry source IDs through the pipeline so the final answer can link back.

Choosing (and not over-choosing) a vector database

The vector database question generates more anxious Slack threads than it deserves. Here's the honest framing.

Start with pgvector if you already run Postgres and have under ~a million vectors. It keeps your data in one place, supports transactional consistency with your metadata, and removes an entire system from your operational surface. The "you need a dedicated vector DB" advice is usually written by vendors of dedicated vector DBs.

Move to a dedicated vector database (Qdrant, Weaviate, Milvus, Pinecone) when you have real scale (many millions of vectors), need sophisticated metadata filtering at speed, or want features like hybrid search and quantization built in. The thing that actually matters across all of them is metadata filtering performance — in production you're almost never doing pure similarity search; you're doing "similar chunks where tenant = X and access_level <= Y and updated_at > Z." Benchmark filtered queries, not the clean ones in the marketing graphs.

And don't forget the boring part: embeddings change. The day you switch embedding models, every vector is stale. Build re-embedding as a first-class, resumable batch job from day one, not a heroic weekend.

The seven-point production checklist

This is the part that doesn't make the demo and entirely determines whether the feature survives contact with users. I don't sign off on an LLM feature until all seven exist.

1. Evals — or you're flying blind

You cannot improve what you can't measure, and "we read some outputs" is not measurement. Before launch, build an eval set: a few dozen to a few hundred representative inputs with either gold answers or a scoring rubric. Run it on every prompt change, model swap, and retrieval tweak. This is the single practice that separates teams who improve their LLM feature from teams who change things and hope.

Three layers, cheapest first: assertion-based (does the output contain the citation? valid JSON? under length?), LLM-as-judge (a model scores relevance/faithfulness against a rubric — cheap, surprisingly good, validate it against humans periodically), and human review on a sample for the things only humans catch. Treat your eval set like a test suite: it's the regression net for a non-deterministic dependency.

2. Guardrails — input and output

The model will be fed adversarial input (prompt injection, especially when retrieved content itself can carry instructions) and will occasionally produce output you can't ship (PII, toxic content, off-topic answers, leaked system prompt). Guard both ends: validate and constrain input, and screen output before it reaches the user. Treat the LLM as untrusted — never let its raw output trigger a privileged action without validation. If the model can call tools, the blast radius of a successful injection is whatever those tools can do. Scope them like you'd scope a service account.

3. A cost ceiling you control

Token costs scale with usage and, unlike most infra, with adversarial usage — someone can run up your bill on purpose. Set per-user and per-tenant rate limits and budget caps. Cache aggressively: identical and semantically-similar queries shouldn't pay twice (semantic caching can cut cost 30-50% on real traffic). Route by difficulty — a small cheap model for easy queries, the expensive one only when needed. I've seen a single unbounded "summarize everything" feature 5x an inference bill in a week.

4. A latency budget and a streaming UX

LLM calls are slow — seconds, not milliseconds — and retrieval adds more. Decide your budget and design for it: stream tokens so perceived latency drops even when total time doesn't, run retrieval and other prep in parallel, and set timeouts with a real fallback. A spinner for eight seconds reads as broken; the same eight seconds streaming reads as thinking.

5. Fallbacks for when the model is unavailable

Your LLM provider will have outages, rate-limit you, and deprecate model versions on their schedule. Design for it: a secondary provider or model, a degraded-but-useful non-AI path (return the raw retrieved documents with a "summary unavailable" notice instead of a hard error), and circuit breakers so a provider incident doesn't cascade. Abstract the provider behind your own interface so swapping isn't a rewrite — this is hexagonal architecture applied to a flaky dependency.

6. Observability built for non-determinism

Standard APM doesn't capture what you need to debug an LLM feature. Log the full chain for every request: input, rewritten query, retrieved chunks with scores, final prompt, raw output, tokens, latency, and cost. When a user reports a bad answer, you need to replay exactly what happened — which chunk misled it, whether retrieval or generation failed. Without this, every bug report is unfalsifiable.

7. A human-in-the-loop boundary

Decide explicitly where the AI is allowed to act versus only suggest. Drafting an email a human sends? Low stakes, let it run. Issuing a refund, changing a config, sending a message on the user's behalf? That needs a confirmation step. The boundary between "AI suggests, human approves" and "AI acts autonomously" is a deliberate architectural decision tied to blast radius — make it on purpose, not by default.

A reference architecture

Putting it together, here's the shape of a production LLM feature:

Notice how little of this is "the model." The model is one box. The architecture is everything around it — and that's where your engineering judgment shows up.

What to do Monday morning

If you have an LLM feature in production or in progress:

  1. Run the RAG ladder against it. Are you using retrieval where a system prompt would do? Are you fine-tuning to inject facts? Cut what you don't need.

  2. Build a 30-example eval set this week. Thirty real inputs with expected qualities. Run it once to get a baseline. This one artifact will change how your team works on the feature — you'll stop guessing.

  3. Audit the seven-point checklist. Score your feature 0-7. Most pre-launch features I review score 2-3 (they have a happy path and maybe rate limiting). Every missing point is a production incident waiting for a date.

  4. Add full-chain logging if you don't have it. You cannot debug what you can't replay. This is the prerequisite for everything else.

Key takeaways

  • The model is the easy 10%; the system is the hard 90%. Retrieval, evals, guardrails, cost control, latency, fallbacks, and observability are the actual engineering. Teams that invert this ratio ship demos, not products.

  • RAG is for large, changing, unpredictably-queried knowledge — not a reflex. Walk the ladder: prompt → cache → RAG → RAG+tools. And remember fine-tuning teaches behavior, not facts.

  • Retrieval quality usually beats model quality. Chunk on structure, rewrite queries, re-rank an over-fetched set, use hybrid search, and always return citations. These beat upgrading the LLM more often than not.

  • Don't over-choose your vector database. Start with pgvector if you already run Postgres and are under ~1M vectors. Move to a dedicated DB for real scale — and benchmark filtered queries, because that's what production actually does.

  • Treat the LLM as an untrusted, flaky dependency. Guard both ends, scope its tools by blast radius, give it fallbacks and a cost ceiling, and decide deliberately where it acts versus suggests.

Your next step

Take your current or planned LLM feature and write its reference architecture as a single diagram — every box between the user and the model and back. If any of the seven checklist items has no box, you've just found your next sprint. The teams that win with AI aren't the ones with access to the best model; everyone has that now. They're the ones who built the system around it.

The model writes the answer. You design the system that makes the answer trustworthy. That's still the job.

Frequently asked questions

When should I use RAG versus fine-tuning an LLM?

Use RAG when the model needs access to large, changing, or proprietary knowledge that it should answer from — documentation, account data, a knowledge base. Use fine-tuning to change the model's behavior: output format, tone, adherence to a structure, or a specialized task style. They solve different problems and are often combined. The common, expensive mistake is fine-tuning to teach facts — facts belong in the retrieved context, because fine-tuning bakes information into weights where it can't be updated or cited and tends to be recalled unreliably.

Do I need a dedicated vector database for RAG?

Not at first. If you already run PostgreSQL and have under roughly a million vectors, pgvector keeps everything in one system with transactional consistency and is enough for most applications. Move to a dedicated vector database (Qdrant, Weaviate, Milvus, Pinecone) when you reach many millions of vectors, need high-performance metadata filtering, or want built-in hybrid search and quantization. When you evaluate them, benchmark filtered queries ("similar chunks where tenant = X") rather than pure similarity search, because filtered queries are what production workloads actually run.

Why are my RAG answers vague or wrong even with a good model?

The most common cause is retrieval, not the model. Check chunking first (split on document structure with slight overlap rather than fixed character counts), add query rewriting so vague user phrasing maps to useful searches, and add a re-ranking step that over-fetches candidates and selects the best few. Use hybrid (vector + keyword) search so exact terms like error codes and product names match. A mediocre model with excellent retrieved context reliably beats a frontier model fed poor context.

How do I control the cost of an LLM feature in production?

Set per-user and per-tenant rate limits and budget caps so usage — including adversarial usage — can't run unbounded. Add caching, including semantic caching for similar queries, which commonly cuts costs 30-50%. Route by difficulty so a small cheap model handles easy requests and the expensive model is reserved for hard ones. Track cost per request in your observability so a single feature can't silently multiply your bill.

How do I prevent prompt injection in an LLM feature?

Treat the model and its inputs as untrusted. Validate and constrain user input, and critically, treat retrieved content as a potential injection vector since it can contain instructions. Screen model output before acting on it, and never let raw model output trigger a privileged operation without validation. If the model can invoke tools, scope those tools tightly — their permissions define the blast radius of a successful injection, so treat them like a service account with least privilege.

What's the minimum I need before launching an LLM feature?

At minimum: an eval set to detect regressions, input and output guardrails, a cost ceiling with rate limits, a latency budget with streaming and timeouts, a fallback for provider outages, full-chain observability so you can replay any request, and an explicit decision about where the AI acts versus only suggests. Most pre-launch features have only a happy path and maybe rate limiting — each missing item is a likely production incident.

#software-architecture#llm#rag#vector-database#ai-architecture#system-design#mcp#ai-engineering#pgvector#2026
Ruchit Suthar

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

Continue Reading

RAG in Production: Chunking, Re-ranking, and Hybrid Search (The Deep Dive)
software architecture

RAG in Production: Chunking, Re-ranking, and Hybrid Search (The Deep Dive)

Naive RAG gets you a 70%-quality demo and a plateau. The gap to production is three retrieval levers most teams never pull: chunking on structure (not character counts), hybrid search (vector + keyword), and re-ranking an over-fetched candidate set. The deep dive on each, plus citations and the metrics that tell you where retrieval is failing. Retrieval quality beats model quality.

·11 min read
Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency
software architecture

Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency

You can't ship a reliable LLM feature on vibes. Evals are the regression net for a dependency that's non-deterministic, drifts when the provider updates the model, and fails silently. How to build one without boiling the ocean: start with 30 real examples, layer three kinds of checks (assertion, LLM-as-judge, human), measure faithfulness, and run it on every prompt, model, and retrieval change.

·11 min read
Common System Architectures: A Reference Catalog Every Architect Should Know (With Diagrams and Code)
software architecture

Common System Architectures: A Reference Catalog Every Architect Should Know (With Diagrams and Code)

A practical reference catalog of the eight architectures worth knowing — layered, modular monolith, hexagonal, event-driven, CQRS + event sourcing, microservices, serverless, and the strangler fig. Each with a diagram, the forces that make it the right call, the failure mode that makes it the wrong one, and a link to runnable reference code. Plus a decision flowchart so you pick on fit, not hype.

·18 min read