Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency

You can't ship a reliable LLM feature on vibes. Evals are the regression net for a dependency that's non-deterministic, drifts when the provider updates the model, and fails silently. How to build one without boiling the ocean: start with 30 real examples, layer three kinds of checks (assertion, LLM-as-judge, human), measure faithfulness, and run it on every prompt, model, and retrieval change.

June 3, 202611 min read

Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency

✦

Key Takeaway

You can't ship a reliable LLM feature on vibes, and "we read some of the outputs" is not a quality process — it's hope with extra steps. Evals are the regression net for a dependency that's non-deterministic, that drifts when the provider updates the model, and that fails silently. This is how to build one without boiling the ocean: start with 30 real examples, layer three kinds of checks (assertion-based, LLM-as-judge, human), measure faithfulness and the failures that actually hurt users, and run it on every prompt, model, and retrieval change. The eval set is the single artifact that turns "we changed the prompt and hoped" into "we changed the prompt and the score went up."

Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency

The most honest thing a team ever told me about their LLM feature: "We don't actually know if it's getting better or worse. We change the prompt, it feels better, we ship. Last month a model update made it worse for two weeks before a customer complained and we figured it out."

That's not negligence — it's the default state of almost every LLM feature I review. With normal code, you have tests; a regression announces itself in CI. With an LLM, the same input can produce different outputs, the dependency silently changes underneath you when the provider ships a new model version, and wrongness arrives in the same confident tone as correctness. You are flying a plane with no instruments and calling the lack of alarms "smooth flying."

Evals are the instruments. They're the regression net that lets you change a prompt, swap a model, or tweak retrieval and know — quantitatively — whether you made it better or worse. Of everything in the LLM production checklist, this is the practice that most separates teams who improve their feature from teams who change things and pray. Here's how to build one that's actually useful, starting small.

What an eval actually is

An eval is a repeatable test for an LLM system: a set of representative inputs, run through your feature, scored against expectations. That's it. The mental model is a test suite for a non-deterministic component — same purpose (catch regressions, enable confident change), different mechanics (you score quality and faithfulness, not exact equality, because outputs vary).

The output isn't pass/fail like a unit test — it's a score you compare against a baseline. "Faithfulness went from 0.82 to 0.91" is the kind of sentence that ends a debate about whether a prompt change helped.

Start with 30 examples (not a platform)

The number one reason teams don't have evals is they imagine a big infrastructure project and never start. Don't. Start with 30 real examples in a spreadsheet or a JSON file.

Real, not synthetic. Pull actual user queries from logs (or realistic ones if pre-launch). Synthetic happy-path examples test the case that already works.
Cover the distribution, especially the edges. Include the easy cases, the ambiguous ones, the adversarial ones, and — critically — past failures. Every time the feature gets something wrong in production, that example goes in the eval set. Your eval set should be a growing museum of every mistake you've made, so you never make it twice.
Capture expected qualities, not exact answers. For most LLM tasks there's no single right string. Record what a good answer must do: cite a source, stay on topic, not invent facts, be under N words, refuse when it should.

Thirty examples you run on every change beats a thousand you'll build "later." You can grow it; you can't improve from zero.

The three layers of scoring

You score outputs with three techniques, cheapest and most objective first. Use all three — they catch different failures.

Layer 1 — Assertion-based (deterministic, free, run on everything). Code-based checks that don't need a model: Is the output valid JSON? Does it contain the required citation? Is it under the length limit? Does it avoid a forbidden phrase or leaked system prompt? Did it call the right tool? These are cheap, fast, and objective — run them on every example, every time. They catch a surprising share of real failures.

Layer 2 — LLM-as-judge (scalable, good, validate it). Use a model to score outputs against a rubric — relevance, faithfulness, helpfulness, tone. It's how you scale beyond what assertions can express, and it's surprisingly effective. Two cautions: give the judge a specific rubric (vague "rate 1-10" is noisy; "score 0 if any claim isn't supported by the provided context" is sharp), and periodically validate the judge against human ratings so you trust its scores. A judge you haven't checked is just another unvalidated model in your loop.

Layer 3 — Human review (gold standard, sampled). Humans catch what code and judges miss — subtle tone problems, domain errors, "technically correct but unhelpful." You can't do this on every example every time, so sample: a handful per release, plus anything the other layers flag as borderline. Human review is also how you calibrate the judge.

Measure the things that actually matter

Pick metrics tied to how your feature fails users, not generic vibes. For a RAG/retrieval feature the high-value ones are:

Faithfulness / groundedness — does every claim trace to the retrieved context, or did it hallucinate? Usually the single most important metric for a knowledge feature.
Answer relevance — does it actually address the question asked?
Retrieval quality — did the right context get fetched? (If retrieval failed, generation never had a chance — measure it separately so you know which stage to fix. See RAG in production.)
Refusal correctness — does it decline when it should (no context, out of scope) and answer when it should?
Safety/format violations — PII leakage, toxicity, invalid structure.

The discipline that makes this pay off: separate retrieval failures from generation failures. "The answer was wrong" is not actionable; "the right document wasn't retrieved" vs "the right document was retrieved but the model ignored it" point at completely different fixes.

Run it on every change, like CI

An eval you run once is a science project. An eval you run on every change is a regression net. Wire it into your workflow so it runs on:

Every prompt change — the most common silent regressor.
Every model swap or version bump — including provider-forced updates. This is how you catch "the model update made it worse" before a customer does.
Every retrieval/RAG change — chunking, re-ranking, embeddings.

The loop that compounds: production failures feed new eval examples; the growing eval set catches more regressions; the feature gets monotonically more reliable. That feedback loop is the whole game.

What to do Monday morning

Create an eval set with 30 real examples today. A JSON file or spreadsheet of actual inputs plus the qualities a good answer must have. Don't build a platform — build the list.
Add Layer-1 assertions first. Valid format, required citation, length, forbidden content. These are free and catch real failures immediately. Run them and get a baseline number.
Add an LLM-as-judge with a sharp rubric for faithfulness and relevance, then spot-check its scores against your own judgment so you trust it.
Wire it to run on every prompt and model change, and make a rule: every production failure becomes a new eval example. That's how the net gets stronger over time.

Key takeaways

Evals are the regression net for a non-deterministic, silently-drifting dependency. Normal tests don't fit, but you still need to know if a change made things better or worse. "We read some outputs" is hope, not a process.
Start with 30 real examples, not a platform. Use actual logs, cover the edges and especially past failures, and record expected qualities rather than exact answers. A small eval you run on every change beats a big one you never build.
Three scoring layers, cheapest first: assertion-based (deterministic, run on everything), LLM-as-judge (scalable, needs a sharp rubric and periodic human validation), and sampled human review (the gold standard that calibrates the rest).
Measure what actually fails users. Faithfulness/groundedness, answer relevance, retrieval quality, refusal correctness, and safety/format — and separate retrieval failures from generation failures, because they need different fixes.
Run it like CI and feed it production failures. Every prompt, model, and retrieval change runs the eval; every production miss becomes a new example. That loop makes the feature monotonically more reliable.

Your next step

Open a file and write down 30 real inputs your LLM feature should handle — pulled from logs if you have them, including every failure you can remember. Run them through the feature once and record how many pass a few basic assertions. That number is your baseline, and the moment you have it, you've gone from "we hope it's good" to "it's at 0.7 and our job is to raise it." Everything good about operating an LLM feature flows from having that number and watching it climb.

Frequently asked questions

What are LLM evals?

LLM evals are repeatable tests for an LLM-powered feature: a set of representative inputs run through the system and scored against expectations. They function as a regression net for a dependency that is non-deterministic (the same input can yield different outputs), that drifts when the provider updates the model, and that fails silently in a confident tone. Unlike unit tests, evals produce a quality score you compare against a baseline rather than a strict pass/fail, letting you tell whether a prompt, model, or retrieval change made the feature better or worse.

How do I start building an eval set?

Start with about 30 real examples in a spreadsheet or JSON file — don't build a platform first. Use actual user queries from logs (or realistic ones if pre-launch), cover the full distribution including ambiguous, adversarial, and previously-failed cases, and record the qualities a good answer must have (cites a source, stays on topic, doesn't invent facts, correct length) rather than a single exact answer. Then make it a rule that every production failure becomes a new eval example so the set grows into a museum of mistakes you never repeat.

What is LLM-as-judge and can I trust it?

LLM-as-judge uses a model to score outputs against a rubric — for example relevance, faithfulness, or tone — which lets you evaluate qualities that simple code assertions can't express, at scale. It's surprisingly effective but must be used carefully: give the judge a specific, sharp rubric (e.g., "score 0 if any claim isn't supported by the provided context") rather than a vague 1-10 scale, and periodically validate its scores against human ratings so you can trust them. Treat it as one layer alongside deterministic assertions and sampled human review, not the only check.

What metrics matter for evaluating an LLM feature?

Choose metrics tied to how your feature fails users. For knowledge/RAG features the high-value ones are faithfulness/groundedness (does every claim trace to the retrieved context, or did it hallucinate), answer relevance (does it address the question), retrieval quality (was the right context fetched), refusal correctness (does it decline when it should), and safety/format violations (PII, toxicity, invalid structure). Critically, separate retrieval failures from generation failures, because "the wrong document was fetched" and "the right document was ignored" require completely different fixes.

How often should I run evals?

Run them on every change that can affect output quality: every prompt change (the most common silent regressor), every model swap or provider version bump (to catch a model update degrading quality before a customer does), and every retrieval or RAG change such as chunking, re-ranking, or embeddings. Wire the eval set into your workflow like CI so a regression blocks the change, and continuously feed production failures back into the dataset so the net gets stronger over time.

Runnable reference: The eval framework described here — accuracy, groundedness, faithfulness, latency, and cost checks — is implemented as a CI-native tool in packages/evals inside the ai-native-app-blueprint repo, and as a standalone installable package at llm-eval-runner. Run these evals in your CI today: pip install llm-eval-runner.

#software-architecture#llm#ai#evals#testing#ai-quality#llm-as-judge#ai-engineering#2026

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency

Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency

What an eval actually is

Start with 30 examples (not a platform)

The three layers of scoring

Measure the things that actually matter

Run it on every change, like CI

What to do Monday morning

Key takeaways

Your next step

Frequently asked questions

What are LLM evals?

How do I start building an eval set?

What is LLM-as-judge and can I trust it?

What metrics matter for evaluating an LLM feature?

How often should I run evals?

Continue Reading

RAG in Production: Chunking, Re-ranking, and Hybrid Search (The Deep Dive)

Retrieval Failure vs Generation Failure: How to Diagnose Which Layer Is Killing Your RAG System

LLM Architecture in Production: RAG, Vector Databases, and the 7-Point System-Design Checklist