How often should I run LLM evals?

Run them on every PR that touches the prompt, retrieval pipeline, KB content, or provider. Run them weekly in production to catch drift from model updates you did not control. The weekly scheduled run is the most important one — it is the only signal you will get when a provider silently updates a model and your feature degrades without any code change on your end.

developer productivity

Shipping an AI Feature Right: A 7-Day Production Walkthrough

Q: How do I write golden test cases that are actually useful?

Cover four categories: typical happy-path queries, edge cases with ambiguous phrasing, known production failure modes, and out-of-scope queries the feature should refuse. Seven to fifteen cases is enough to start — quality matters more than quantity. Use mustContain and mustNotContain assertions rather than exact match, since LLM output varies. Add a new case every time something fails in production; the eval suite becomes a museum of mistakes you never repeat.

Most teams ship an LLM call in an afternoon and spend the next month firefighting. This walkthrough shows the correct order — spec, architecture decision, eval criteria, implementation, CI gate, production observability — using a real cloneable repo (spec-to-ship-workflow) that runs in 10 minutes with zero API keys. Covers the retrieval-confidence floor that prevents most RAG hallucinations, two-mode providers for CI reproducibility, golden test cases before implementation, and the eval drift alert that catches regressions no other metric sees.

June 5, 202615 min read

Shipping an AI Feature Right: A 7-Day Production Walkthrough

✦

Key Takeaway

Most teams ship an LLM call in an afternoon and spend the next month firefighting. This walkthrough shows the correct order — spec, then architecture decision, then eval criteria, then implementation, then CI gate, then production observability — using a real cloneable repo that runs in 10 minutes with zero API keys.

Shipping an AI Feature Right: A 7-Day Production Walkthrough

You can clone spec-to-ship-workflow and run every step described in this post before you finish reading it.

The order most teams get wrong

When a team decides to add an AI feature, here's what usually happens:

Engineer opens a file
Engineer pastes in an LLM API call
Engineer ships it to production
Something goes wrong
Engineer adds a try/catch
Something else goes wrong

The problem isn't using LLMs. The problem is the order. The code got written before anyone answered three questions:

What does "working" mean? (There's no spec, so there's no way to know.)
How will we know if it regresses? (There are no evals, so the answer is "a customer tells us.")
What happens when the model gets it wrong? (There are no guardrails, so the answer is "it answers confidently with something invented.")

The 7-day arc is the correct order. Each day adds something the previous day makes possible.

The feature: a support QA assistant

The worked example is concrete enough to be real: a question-answering feature that retrieves relevant help content and returns a grounded, cited answer. Something a product team at a scale-up would actually build.

The knowledge base is a small set of help articles. The input is a user question. The output is an answer with sources, latency, and cost.

Simple enough to understand in 5 minutes. Complex enough to go wrong in 5 ways.

Day 0: Write the spec before any code

The first commit in the repo is a spec. No code. Just a document.

# AI Feature Spec — Support QA

## Problem statement
Support agents and customers waste time hunting through help articles for answers
that already exist. We want a question-answering feature that retrieves the
relevant help content and returns a grounded, cited answer — never an invented one.

## Input / output contract

Input:  { question: string }
Output: { answer, sources, latencyMs, costUsd }

## Acceptance criteria

| # | Criterion              | Measurement          | Threshold |
|---|------------------------|----------------------|-----------|
| 1 | Answers correctly      | Accuracy (F1)        | ≥ 0.40    |
| 2 | Grounded in context    | Groundedness         | ≥ 0.60    |
| 3 | Fast enough inline     | Latency p95          | ≤ 2000ms  |
| 4 | Affordable at scale    | Cost per query       | ≤ $0.005  |
| 5 | Refuses out-of-scope Q | mustContain "don't have" | enforced  |

The spec was generated with npx @ruchit07/ai-spec init "support QA", then filled in.

Why this matters: The spec is the contract between the feature and the eval suite. Without it, you can't define what "passing" means. Without a definition of passing, you can't have a CI gate. Without a CI gate, the only way to detect a regression is a customer complaint.

The spec also records known failure modes for v1: multi-turn questions, multi-article spans, non-English. Not because these are unimportant, but because being explicit about what v1 doesn't cover is how you avoid scope creep and half-built features.

The right tool: The ai-spec CLI scaffolds the structure in seconds:

npx @ruchit07/ai-spec init "support QA feature for help center"

It generates spec.md, eval-criteria.json, a seed test-cases.json, and an ADR starter. Day 0 takes an hour, not a day.

Day 1: Decide the architecture and record it

Day 1's commit is an ADR. Still no feature code.

# ADR-0001: In-Memory TF-IDF for v1, pgvector Later

## Decision
Use an in-memory TF-IDF retriever for v1, behind a stable retrieve(query, topK)
interface. Pair it with a two-mode provider: a deterministic offline mock by
default, real OpenAI when AI_MODE=openai.

## When to revisit
Move to pgvector + hybrid-search when:
- the KB exceeds a few hundred articles, OR
- users ask paraphrased/semantic questions TF-IDF can't match, OR
- you need the index to persist across processes.
The spec and the eval suite don't change when you do — that's the payoff of
the stable interface.

Two things to notice:

The decision is explicit. Not "we used TF-IDF" buried in a git blame. An ADR explains why, what the trade-offs were, and what would trigger changing it. Six months later, when someone asks "why aren't we using embeddings?" the answer is in the repo, not in the leaving engineer's head.

The interface abstracts the implementation. retrieve(query, topK) is stable regardless of whether what's behind it is TF-IDF, pgvector, or a hybrid pipeline. When you graduate to the ai-native-app-blueprint RAG pipeline, you change one file. The spec doesn't change. The evals don't change.

This is what "paying off later" actually means. Not "we can refactor it someday" — a concrete migration path with a clear trigger condition.

Day 2: Write the evals before the feature

Day 2's commit is a test suite. Still no feature implementation.

// eval-criteria.json
{
  "suite": "support-qa",
  "thresholds": {
    "accuracy": 0.4,
    "groundedness": 0.6,
    "latencyP95Ms": 2000,
    "costPerQueryUsd": 0.005
  },
  "failOnRegression": true
}

// test-cases.json (7 cases)
[
  { "id": "tc-001-password-reset",
    "input": { "question": "How do I reset my password?" },
    "expected": { "mustContain": ["Forgot Password"] } },
  { "id": "tc-002-invoices",
    "input": { "question": "Where can I find my invoices?" },
    "expected": { "mustContain": ["Billing"] } },
  // ... tc-003 through tc-006: in-scope questions ...
  { "id": "tc-007-out-of-scope",
    "input": { "question": "What is the CEO's home address?" },
    "expected": { "mustContain": ["don't have"], "mustNotContain": ["Street", "Avenue"] } }
]

tc-007 is the most important test case in the suite. More on that in a moment.

Why evals before implementation? The same reason you write unit tests before the code: it forces clarity. When you write the test cases, you answer the question "what should the feature actually do?" in a form precise enough to automate. You discover ambiguities in the spec before you're committed to an implementation. And the test cases become the contract the implementation must satisfy — rather than the implementation becoming the definition of correct.

Teams that skip this step ship AI features that pass the demo but fail on real-world edge cases, with no automated way to catch the regression when a model update changes the behavior.

Days 3–5: Implement against the spec

Now, and only now, do you write the feature. The spec is the contract. The eval suite is the acceptance test. The implementation's job is to satisfy both.

The core feature (src/feature.ts) is about 50 lines:

const MIN_RELEVANCE = 0.05;

export async function ask(question: string): Promise<AskResult> {
  const start = Date.now();

  const retrieved = getRetriever()
    .retrieve(question, 3)
    .filter((r) => r.score >= MIN_RELEVANCE);

  // Retrieval gate: nothing relevant → deterministic refusal, NO LLM call.
  if (retrieved.length === 0) {
    const refusal = "I don't have a help article that covers this question.";
    return {
      answer: refusal,
      context: refusal,
      sources: [],
      latencyMs: Date.now() - start,
      costUsd: 0,
    };
  }

  const context = retrieved
    .map((r, i) => `[${i + 1}] ${r.doc.title}: ${r.doc.content}`)
    .join('\n\n');

  const { answer, costUsd } = await generate(question, context);

  return { answer, context, sources: ..., latencyMs: Date.now() - start, costUsd };
}

The retrieval-confidence floor: the most important guardrail

MIN_RELEVANCE = 0.05 is the single most important line in the implementation. Here's why.

Without it, tc-007 ("What is the CEO's home address?") would retrieve the most loosely related article — maybe one mentioning the CEO in a company overview. The retrieval score would be weak, but non-zero. The context passed to the model would be irrelevant. The model, given some context and a direct question, would try to answer.

You get a confident-sounding answer about a CEO's "location" derived from an article about the company's founding. Wrong, but fluent.

With the retrieval floor: if nothing scores above 0.05, the context is empty. The feature returns a deterministic refusal without calling the LLM — zero cost, zero hallucination. tc-007 passes because the answer contains "don't have".

This is the pattern that prevents most RAG hallucinations. Not a better model. Not a longer prompt. A gate before the LLM call that says: if retrieval failed, don't generate.

The two-mode provider

The feature runs in two modes:

Default (offline): a deterministic mock generator. Outputs predictable answers from the context, no API key needed.
OpenAI mode: AI_MODE=openai OPENAI_API_KEY=sk-... npm run ask "..."

The retriever and the eval harness don't change between modes. Only the generator does. This lets CI run without secrets, while production uses the real model.

Day 6: Wire the CI gate

Day 6's commit adds .github/workflows/ci.yml. The pipeline is three steps:

test → evals:generate → evals:grade

npm test — unit tests on the retriever, provider, and feature logic.
npm run evals:generate — run the feature against all 7 golden test cases, write answers to evals/results.json.
npm run evals:grade — run llm-eval-runner against results.json + eval-criteria.json. Non-zero exit if any metric is below threshold.

# Local — same as CI
npm run evals:generate
pip install llm-eval-runner
npm run evals:grade

Expected output:

Result: ✅ PASSED
Cases: 7/7 passed (100%)
Metrics:
  ✓ accuracy         ████████████████████  100.0% (≥ 40%)
  ✓ groundedness     ████████████████████  100.0% (≥ 60%)
  ✓ latency          ████████████████████  100.0% within p95 budget

What this buys you: a prompt change that causes a regression fails CI before anyone reviews the PR. A model provider update that degrades accuracy fails CI before it ships. The golden set is the net. Every production failure that gets added to test-cases.json makes the net stronger.

The llm-eval-runner is a standalone installable package (pip install llm-eval-runner) that any repo can use. The grader is the same one running here — composable by design.

Day 7: Production observability and the eval drift alert

The feature emits three numbers per query: latency, cost, and retrieval scores. Day 7 wires them to telemetry.

The four panels that matter in production:

Panel	What to watch	Alert when
Cost per day	`sum(cost_usd)`	Daily cost > budget × 1.2
Latency p95	`p95(latency_ms)`	p95 > 2000ms
Refusal rate	`answers containing "don't have" / total`	Spike → KB gap or retrieval regression
Eval drift	Weekly graded score	Any metric drops below threshold

The eval drift alert is the most important one and the one most teams skip. LLM features regress without code changes — a model provider update, a KB edit that changes an article's wording, a subtle prompt drift. You won't catch it from latency or error rate metrics, because the feature is still "working." The eval suite catches it.

Schedule it weekly:

npm run evals:generate && npm run evals:grade

If it fails: the report tells you which metric and which cases. That's the starting point for the investigation, not the end of it.

The complete toolchain: how it composes

The repo is designed to show how the tools compose:

Day 0 / Day 2 — ai-spec CLI
  ↓  generates spec.md, eval-criteria.json, test-cases.json

Days 3–5 — the feature (this repo)
  ↓  implements against the spec's contract

Day 6 — llm-eval-runner
  ↓  grades the feature, gates CI, fails on regression

Scale-up — ai-native-app-blueprint
  ↓  swap TfIdfRetriever → packages/rag (pgvector + hybrid search)
     spec and evals unchanged

Each tool is independently installable and useful standalone. The point of the spec-to-ship-workflow repo is to show them working together end-to-end, with a real feature, real eval scores, and a CI pipeline that proves it.

You can clone it, run it in 10 minutes, read the git history commit by commit, and see the exact decisions — not just the result.

What to apply to your next AI feature

The discipline in this repo isn't specific to a support QA feature. It applies to any LLM feature: content generation, code review bots, classification, semantic search, summarization.

Before writing code:

Write the spec. What does "working" mean, precisely? What are the acceptance criteria you'd put in a CI gate?
Write the ADR. What retrieval approach? What provider? What's the stable interface so you can swap the implementation later?
Write the golden test cases. Cover the happy path, the edge cases, and at least one out-of-scope query.

During implementation: 4. Implement the retrieval-confidence floor. If retrieval fails, refuse before calling the LLM. This prevents most RAG hallucinations. 5. Build a two-mode provider (deterministic mock + real model) so CI can run without secrets.

Before shipping: 6. Wire the eval gate to CI. A regression should fail the build, not a customer report.

In production: 7. Track latency, cost, and refusal rate per query. Schedule weekly eval runs to catch drift.

The order matters. You can't have a CI gate without evals. You can't write useful evals without a spec. Doing days 3–5 first and hoping to add the rest later means the rest never gets added — there's always something more urgent.

Run it now

git clone https://github.com/ruchit07/spec-to-ship-workflow
cd spec-to-ship-workflow
npm install

# Ask a question (offline, deterministic)
npm run ask "How do I reset my password?"

# Run the unit tests
npm test

# Grade the full eval suite
npm run evals:generate
pip install llm-eval-runner
npm run evals:grade

# Walk the git history to read the decisions in order
git log --oneline

The toolchain:

ai-spec — scaffolds the spec and eval structure (Day 0/2)
llm-eval-runner — grades answers, gates CI (Day 6)
ai-native-app-blueprint — production RAG pipeline for when v1 needs to scale

Key takeaways

The order is the lesson. Spec → ADR → evals → implementation → CI → observability. Not the other way around.
The retrieval-confidence floor prevents most RAG hallucinations — not a better model, a gate before the LLM call.
Evals before implementation forces clarity on what "working" means before you're committed to an approach.
The stable interface (retrieve(query, topK)) is what lets you swap TF-IDF for pgvector without changing the spec or the evals.
The eval drift alert catches regressions that latency and error metrics miss — LLM features degrade without code changes.
The two-mode provider (mock + real) lets CI run without API keys, reproducibly.
Every production failure should become a new test case. The eval suite is a museum of mistakes you never repeat.

FAQ

What makes an AI feature spec different from a regular feature spec?

An AI feature spec must include measurable quality thresholds because AI output is probabilistic and can't be validated by unit tests alone. A regular spec might say "returns the user's invoices." An AI spec says "answers invoice-related questions with groundedness ≥ 0.60 and latency p95 ≤ 2000ms, and refuses when the answer isn't in the knowledge base." The thresholds become the CI gate. Without them, you can't automate regression detection.

What is the retrieval-confidence floor and why does it matter?

The retrieval-confidence floor is a minimum relevance score below which the feature refuses to answer, without calling the LLM. If the best-matching document scores below the threshold, there's no relevant context to ground an answer — so instead of passing weak context to the model (which will try to answer anyway and likely hallucinate), the feature returns a deterministic refusal. It saves LLM cost on every out-of-scope query and eliminates the most common failure mode: confident-but-grounded-in-nothing answers.

When should I replace TF-IDF with vector search?

When your knowledge base exceeds a few hundred articles, when users ask paraphrased or semantic questions that don't match keywords, or when you need the index to persist across processes. TF-IDF is reliable for small, keyword-rich corpora and requires no infrastructure. Embeddings + pgvector handle semantic similarity and scale. The key is designing a stable retrieve(query, topK) interface from the start, so the migration changes one file and leaves the spec, evals, and CI gate unchanged. See ADR-0001 for the specific trigger conditions.

How do I write golden test cases that are actually useful?

Cover four categories: typical happy-path queries (the most common things users actually ask), edge cases (ambiguous phrasing, partial matches), known failure modes (things that broke in production before), and out-of-scope queries (things the feature should refuse). Seven to fifteen cases is enough to start — quality matters more than quantity. Use mustContain and mustNotContain assertions rather than exact match, since LLM output varies. Add a new case every time something fails in production.

How often should I run evals?

Run them on every PR that touches the prompt, the retrieval pipeline, the KB content, or the provider. Run them weekly in production to catch drift from model updates you didn't control. The nightly or weekly scheduled run is the most important one — it's the only signal you'll get when a provider silently updates a model and your feature degrades without any code change on your end.

Why does the CI pipeline use a mock LLM instead of a real one?

Reproducibility and cost. A mock LLM returns deterministic answers based on the context, so CI results are consistent across runs and don't depend on API availability or pricing. The eval suite still runs fully — all 7 cases, all metrics — because the grader evaluates the answers against the criteria regardless of how they were generated. For final validation before a major release, run the eval suite in AI_MODE=openai to verify real-model quality. For every-PR CI, the mock is the right choice.

#developer-productivity#ai#spec-first-development#llm-evals#ai-feature-development#rag#ci-cd#production-ai#2026

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

Shipping an AI Feature Right: A 7-Day Production Walkthrough

Shipping an AI Feature Right: A 7-Day Production Walkthrough

The order most teams get wrong

The feature: a support QA assistant

Day 0: Write the spec before any code

Day 1: Decide the architecture and record it

Day 2: Write the evals before the feature

Days 3–5: Implement against the spec

The retrieval-confidence floor: the most important guardrail

The two-mode provider

Day 6: Wire the CI gate

Day 7: Production observability and the eval drift alert

The complete toolchain: how it composes

What to apply to your next AI feature

Run it now

Key takeaways

FAQ

What makes an AI feature spec different from a regular feature spec?

What is the retrieval-confidence floor and why does it matter?

When should I replace TF-IDF with vector search?

How do I write golden test cases that are actually useful?

How often should I run evals?

Why does the CI pipeline use a mock LLM instead of a real one?

Continue Reading

Retrieval Failure vs Generation Failure: How to Diagnose Which Layer Is Killing Your RAG System

SpecLoom: Deterministic Context for Coding Agents

MCP Servers Explained: Giving Your AI Tools Real Context (A Practical Setup)