SpecLoom: Deterministic Context for Coding Agents
Most agent SDLC setups use the LLM as the runtime for everything—including deciding which files to read—which is the biggest source of token waste and non-determinism. SpecLoom flips this: write your spec as typed blocks with IDs and dependencies, and a deterministic compiler emits a minimal, hash-stamped bundle for one task. A real engineer bundle compiles to ~370 tokens instead of 20–60k, the same task always produces a byte-identical bundle, and @spec:ID#hash anchors turn spec/code drift into a CI failure. Covers the .loom format, the Deterministic Context Compiler, tiered budget degradation, the drift gate, engine-enforced persona gates, and a 60-second loop to try it.
Ruchit Suthar
15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

Most agent SDLC setups use the LLM as the runtime for *everything* — including deciding which files to read and assembling context — which is the single biggest source of both token waste and non-determinism. The fix is to compile, not read: write your spec as small typed blocks with IDs and dependencies, and let a deterministic compiler emit a minimal, hash-stamped bundle for one task. I built **SpecLoom** to do exactly this. A real engineer bundle compiles to ~370 tokens instead of the 20–60k of "read the docs folder," the same task always produces a byte-identical bundle, and `@spec:ID#hash` anchors turn spec/code drift into a build failure instead of an archaeology project. This post is the design — the eight problems it solves, the `.loom` format, the compiler, the drift gate — and a 60-second loop to try it.
SpecLoom: Deterministic Context for Coding Agents
June 2026. I'm reviewing an agent-built PR on a service that's been under AI-assisted development for about two months. The diff looks fine. Then I ask the obvious question — "does this still match the spec?" — and realize nobody can answer it mechanically. The spec is a 4,000-word markdown doc. The code is the code. The only way to know whether they agree is to read both and hold them in my head, which is precisely the work the spec was supposed to save.
Worse, I pull up the agent's context logs and watch it load the entire PRD, the architecture doc, and a 2,000-token persona prompt — on every single turn — to implement a change that touches one function. Tens of thousands of tokens per call, almost none of it relevant. And because the agent "decides what to read," two runs of the same task pulled different context and produced different code. The output wasn't reproducible. Debugging it was folklore.
This is the state of spec-driven development with agents in 2026, and it has a single root cause.
The root cause: the LLM is the runtime for everything
The two frameworks that dominate this space each taught us something real, and each shares one structural flaw.
GitHub Spec Kit popularized the clean staged pipeline — constitution → specify → plan → tasks → implement. The flow is good. But its artifacts are free-form prose, so agents must read and interpret whole documents on every step, there's no machine-checkable link between a spec sentence and the code that implements it, and nothing stops the two from drifting after sprint two.
BMAD-METHOD popularized rich agent personas — Analyst, PM, Architect, Dev, QA — that simulate a real team. Role discipline is good. But the personas are large, always-loaded prompts, every invocation re-ships big documents into context, and the "role behavior" is flavor text the model may or may not honor. Nothing enforces the gates.
Both use the LLM as the runtime for work that needs no intelligence at all: deciding which files to read, assembling context, checking whether a stage is complete, tracking what depends on what. That's the waste. That's the non-determinism. You cannot make a model deterministic — but you can make everything around it deterministic.
That's the entire idea behind SpecLoom. I call it the pseudo-engine principle: anything that doesn't require intelligence runs as deterministic code at zero token cost. Parsing, dependency resolution, ordering, budgeting, gate enforcement, drift checking — all engine. The LLM is invoked only for the genuinely generative steps: writing specs, designing, coding, reviewing.
Compile, don't read
The shift in one sentence: stop handing the agent a folder and asking it to figure out what's relevant; compile the exact dependency closure for one task and hand it nothing else.
You write specs as small typed blocks. Each block has an ID, a type, and explicit needs=[...] dependencies:
::: id=REQ-001 type=requirement needs=[GOAL-001,NFR-001] tier=1 owner=pm status=approved
Email+password signup with a verification link.
Acceptance:
- A1: valid email + 8+ char password creates an unverified account
- A2: verification email arrives within 30s; link valid 24h, single-use
- A3: signing up with an existing email returns a clear, non-enumerating error
:::A .loom file is just markdown; the compiler only reads the :::-fenced blocks and ignores everything else, so you can write as much human prose around them as you like. When you compile a task, the engine walks the needs graph, resolves the transitive closure, and emits a bundle containing only those blocks — plus the constitution (your invariants) and the persona contract for the role doing the work.
Here's the measured difference. A complete engineer bundle for a real task — constitution + engineer contract + task + requirement + NFR + goal, six blocks — compiles to ~370 estimated tokens:
$ npx @ruchit07/specloom compile --task TASK-001 --persona engineer
[specloom] hash=da3f0a7ce907 blocks=6 est_tokens=367/8000 degraded=falseThat's against a corpus of any size. The same work under document-reading SDD routinely costs 20–60k input tokens per agent turn. Order-of-magnitude effect: roughly a 10–30× reduction, with higher fidelity, because the context is exactly the dependency closure and nothing else — no irrelevant architecture doc for the model to get distracted by.
The other half of the savings is the rules file. CLAUDE.md, .cursor/rules, and friends get loaded on every request, so they're the most expensive tokens in your project. The common anti-pattern is stuffing them with project knowledge. SpecLoom's adapters emit a rules file that contains only the protocol (~250 tokens) — compile a bundle, work only from it, anchor your code, run verify — and never project knowledge. Project knowledge lives in blocks that get compiled on demand.
Determinism you can put in a commit message
"The agent decides what to read" is the line that quietly kills reproducibility. Same task, different context, different output, every run.
SpecLoom's Deterministic Context Compiler (DCC) makes the bundle a pure function of its inputs. Same workspace + same task ID + same budget ⇒ byte-identical bundle, with a SHA-256 content hash in the header:
<!-- specloom bundle hash=da3f0a7ce907 -->
<!-- manifest {"task":"TASK-001","persona":"engineer","blocks":[...],"est_tokens":367,...} -->This isn't an accident of implementation — it's enforced by three rules: file walking is sorted, graph traversal order is total (by tier, then lexicographic ID), and budgeting is a pure function. The test suite literally asserts two compiles of the same task are byte-identical.
Why care? Because that hash goes into your commit messages and CI logs. When an agent produces output you don't understand three weeks later, you can trace it back to the exact context that produced it. Reproducibility stops being folklore.
Drift becomes a build failure
Specs rot the moment code evolves, and no mainstream SDD framework can answer "is this spec still true?" mechanically. SpecLoom can, because implemented code carries an anchor comment back to the block it satisfies:
// @spec:REQ-001#40d05b9f
export function signup(email, password) { /* ... */ }The hash is the spec block's content hash at implementation time. specloom verify scans your source tree for these anchors and reports three drift states — and it's a CI gate, so it exits non-zero on any finding:
| Kind | Meaning |
|---|---|
| STALE | the anchor's hash no longer matches the block — the spec changed after the code was written |
| ORPHAN | the anchor points to a block ID that no longer exists |
| UNBOUND | an approved requirement or task has zero implementing anchors — spec'd but not built |
$ specloom verify
STALE REQ-001 src/auth.js:42 40d05b9f
[specloom] 1 finding(s)That STALE line means someone edited the requirement and the code didn't follow. Before, that divergence was invisible until a user found it. Now it fails the build. The cost is one short comment per implementing unit — cheap insurance for anyone who's lived through spec rot.
The pipeline and the personas
SpecLoom ships ten personas mapped onto a seven-stage pipeline. Each persona is not a 2,000-token biography — it's a ~90-token contract with exactly five lines: ROLE / DO / DON'T / OUT (output schema) / GATE (what to reject). The full set of ten costs under 1,000 tokens combined — less than one BMAD persona.
The crucial part: the engine enforces the gates, not the model. A block with status=draft simply cannot be compiled into an implementation bundle — only status=approved blocks can. The persona's GATE line tells the agent what to reject; the engine makes skipping the gate impossible. Role discipline stops being a suggestion.
And when the agent hits missing information, it doesn't guess. Every bundle's operating rules carry the NEED protocol: if information is missing, output exactly NEED: <block-id or question> and stop. Inventing a requirement is a protocol violation. The most expensive tokens are the wrong ones, and speculative generation is where wrong tokens come from.
Autopilot with a human hand on every gate
The pipeline above isn't just documentation — it's a deterministic state machine in the engine. specloom flow owns which stage a feature is on, runs each stage's mechanical gate, and advances only on explicit human approval. The developer's entire job becomes review and approval; everything else is the agent working from compiled bundles.
specloom flow start "users can reset their password via an email link"
# then your agent loops:
specloom flow status # current stage, persona(s), the exact block types to produce, the gate
specloom flow next # the precise instruction for this stage
# …agent generates that stage's blocks (or, at build, code + tests + @spec anchors)…
specloom flow approve # YOU run this after review — the gate runs mechanically, advances on passThe gates aren't vibes. Spec stages require approved blocks of the expected types in that feature's own folder — one feature's artifacts can never vacuously satisfy another's gate, so concurrent features stay honest. The build and verify stages run full drift detection: if any anchor is STALE or any approved requirement is UNBOUND, flow approve exits non-zero and names exactly what's missing. And because a draft block cannot be compiled into a build bundle, the agent physically cannot skip ahead — the discipline is enforced by the engine, not requested by the prompt.
The orchestration is also tool-agnostic. specloom command all writes the same specloom-ship playbook into each tool's native command location — Claude Code (.claude/commands/), Cursor (.cursor/commands/), Copilot (.github/prompts/), Windsurf (.windsurf/workflows/) — plus a universal SPECLOOM_SHIP.md any agent can follow. Whichever assistant your teammates prefer, they invoke the same command and get the same gated pipeline.
Tiered degradation, and the refusal to truncate
A serious product's spec corpus won't fit in any context window — and shouldn't need to, because bundles are per-task. But individual bundles can still get big, so blocks carry a tier: 0 = invariants (never dropped), 1 = task-core (never dropped), 2 = supporting (can be summarized), 3 = background (can be dropped).
When a bundle exceeds budget, the compiler degrades in order: drop tier-3 first, then summarize tier-2 bodies — and summarizing preserves every acceptance line verbatim while compressing prose, because acceptance criteria are the one thing you can never lose. If after all that it still won't fit, the compiler does the one thing every other tool refuses to do: it throws an error instead of silently truncating.
SpecLoom refuses to silently truncate tier 0-1 context: bundle is 9,120 est. tokens
against a budget of 8,000. Split the task into smaller TASK-* blocks.Silent truncation is the root cause of confident hallucination — the model fills the gap with something plausible and wrong. Loud failure forces the correct fix, which is splitting the task anyway.
How it compares
| GitHub Spec Kit | BMAD-METHOD | SpecLoom | |
|---|---|---|---|
| Spec form | prose docs | prose docs + templates | typed, ID'd, hashed blocks |
| Context assembly | agent reads files | agent reads files | deterministic compiler |
| Token posture | heavy | very heavy | budgeted, ~10–30× lighter |
| Drift detection | none | none | hash anchors + CI gate |
| Personas | none | large always-loaded prompts | ~90-token contracts + engine gates |
| Multi-tool | scripts per tool | partial | adapter emitter (5 targets) |
| Determinism | no | no | byte-identical bundles, hashed |
One source of truth — the .loom workspace — and specloom adapt all emits native protocol files for Claude Code, Cursor, Copilot, Windsurf, and the generic AGENTS.md. Every tool learns the same protocol; you're not locked into one.
What to do Monday morning
You don't have to adopt the whole pipeline to get value. Try the 60-second loop on one real task:
npx @ruchit07/specloom init # scaffold specs/ with a worked example
npx @ruchit07/specloom adapt all # emit the ~250-token protocol files for your tools
# convert one feature's spec into typed blocks in specs/*.loom, then:
npx @ruchit07/specloom compile --task TASK-001 --persona engineer | pbcopy
# paste that bundle into your agent — that's its entire context
# after the agent implements and anchors the code:
npx @ruchit07/specloom verify # confirm zero drift before you call it doneAnd when you're ready for the full pipeline, add the orchestrator: npx @ruchit07/specloom command all installs the specloom-ship command for every coding agent, and specloom flow start "<requirement>" kicks off the seven-stage walk where your only job is flow approve at each gate.
Start with the drift gate even if you adopt nothing else: take one requirement you've already built, run specloom hash REQ-001, paste the @spec:REQ-001#<hash> anchor onto the function that implements it, and add specloom verify to CI. You now have a mechanical answer to "is this spec still true?" — which is more than Spec Kit or BMAD can give you.
Key takeaways
- The LLM should not be the runtime for non-intelligent work. Parsing, context assembly, gate-checking, and drift detection are deterministic — run them as an engine at zero token cost.
- Compile, don't read. Ship the dependency closure of one task (~370 tokens), not the docs folder (20–60k). Reference by ID; inline once per bundle.
- Determinism is a feature you can commit. Byte-identical bundles with a content hash make agent output traceable and reproducible.
- Anchors make drift a build failure.
@spec:ID#hash+ a verify gate catches STALE, ORPHAN, and UNBOUND before users do. - Gates belong to the engine, not the prompt. Only
status=approvedblocks compile;NEED-and-stop replaces guessing. - Refuse to truncate. Loud failure beats confident hallucination every time.
Your next step
Clone the repo, read ANALYSIS.md for the full eight-challenge breakdown, and run npx @ruchit07/specloom init on a throwaway directory to see the worked example compile. If you're not yet sold on the whole framework, start one level up with the spec-first workflow — SpecLoom is what that method looks like when you make the engine, not the model, responsible for the discipline.
SpecLoom on GitHub: github.com/ruchit07/specloom
Frequently asked questions
What is the difference between SpecLoom and GitHub Spec Kit?
Spec Kit gives you a staged pipeline (constitution → specify → plan → tasks → implement) with free-form markdown artifacts that the agent reads and interprets on every step. SpecLoom keeps the staged idea but replaces prose with typed, ID'd blocks and replaces "agent reads files" with a deterministic compiler that emits only the dependency closure for one task. The practical differences: SpecLoom bundles are ~10–30× smaller, byte-identical across runs, and carry hash anchors so spec/code drift is a CI failure — none of which Spec Kit provides.
How does SpecLoom reduce token usage by 10–30×?
The savings come from four things, in priority order. First, closure not corpus: a per-task bundle ships 5–15 blocks instead of whole documents, so it's a few hundred to a few thousand tokens regardless of corpus size. Second, rules files carry only the ~250-token protocol, never project knowledge, since they're paid for on every request. Third, blocks reference each other by ID and the compiler inlines a shared definition exactly once. Fourth, persona contracts are ~90 tokens each instead of multi-thousand-token role prompts. A task that costs ~30k input tokens per turn under document-reading SDD drops to ~1–3k.
What are @spec hash anchors and how do they detect drift?
An anchor is a comment in your code like @spec:REQ-001#40d05b9f where the hex suffix is the SHA-256 content hash of the spec block at the time the code was written. specloom verify recomputes each block's current hash and compares. If they differ, the spec changed after implementation and the anchor is reported STALE. If the block ID no longer exists, it's ORPHAN. If an approved requirement or task has no anchor anywhere, it's UNBOUND. The command exits non-zero on any finding, so it works as a CI gate.
Does SpecLoom lock me into one AI coding tool?
No. The .loom workspace is the single source of truth, and specloom adapt all emits native protocol files for five targets: CLAUDE.md (Claude Code), AGENTS.md (generic), .cursor/rules/specloom.mdc, .github/copilot-instructions.md, and .windsurf/rules/specloom.md. Each file carries the same ~250-token protocol, so whichever tool a teammate uses learns the same workflow — compile a bundle, work only from it, anchor the code, run verify.
Why does SpecLoom throw an error instead of truncating context to fit a budget?
Because silent truncation is the root cause of confident hallucination — the model fills the missing context with something plausible and wrong, and you have no signal that it happened. SpecLoom degrades gracefully first (drops tier-3 background, then summarizes tier-2 support while preserving every acceptance line verbatim), but it never touches tier-0 invariants or tier-1 task-core. If the protected core still doesn't fit the budget, the compile fails loudly. The correct response is to split the task into smaller units, which is the right design move regardless.
Can SpecLoom automate the whole SDLC so the developer only reviews and approves?
Yes — that's what specloom flow and the specloom-ship command do together. A deterministic state machine owns the seven stages (discover → specify → design → plan → build → verify → ship); your coding agent does the generative work for each stage from compiled bundles, then stops and presents the output. The developer's only actions are reviewing each stage's blocks, marking them approved, and running specloom flow approve — which executes the stage's mechanical gate (approved artifacts of the right types for spec stages, zero drift findings for build and verify) and advances only on pass. The agent cannot advance the pipeline itself, and draft blocks cannot compile into build bundles, so no stage can be skipped.
Is SpecLoom production-ready, and what does it cost to run?
The core engine — deterministic compile, budget enforcement, drift detection, multi-tool adapters — was prototyped and tested end-to-end before release, and the package ships with a deterministic test suite you can run with node test/run.js. It has zero runtime dependencies (Node ≥ 18, plain ESM), so there's nothing to install beyond the package itself, and the engine runs locally at zero token cost — you only spend tokens on the genuinely generative steps. Token estimation uses a deterministic chars/4 proxy rather than a real tokenizer; set budgets ~15% under your true target to stay safe.

Ruchit Suthar
15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.


