developer productivity

AI-Driven Development: The Spec-First Workflow That Makes Agents Actually Useful

Vibe coding — prompt, accept, repeat — produces fast demos and slow disasters. The senior move is spec-first development: invest in a precise specification, let agents implement against it with MCP for real context, and gate everything behind tests, types, and human review of intent. The four-phase loop, why the spec becomes the asset when code is cheap, where autonomous PRs actually fit, and the failure modes (context rot, confident wrongness, review debt) that bite teams who skip the discipline.

Ruchit Suthar

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

14 min read
AI-Driven Development: The Spec-First Workflow That Makes Agents Actually Useful
Key Takeaway

"Vibe coding" — prompt, accept, repeat — produces fast demos and slow disasters. The senior move in 2026 is the opposite: spec-first development, where you invest in a precise specification and let agents implement against it, with MCP for context and tight review gates for safety. This is the workflow I use to ship agent-built code I'm willing to put my name on: write the spec as the source of truth, wire up MCP so the agent has real context instead of guesses, let it implement in a bounded loop, and gate everything behind tests, types, and human review of *intent*. Plus where autonomous PRs actually fit, and the failure modes (context rot, confident wrongness, review debt) that bite teams who skip the discipline.

AI-Driven Development: The Spec-First Workflow That Makes Agents Actually Useful


February 2026. I'm pairing with a strong mid-level engineer who's become genuinely fast with AI. He's flying — prompt, tab, accept, prompt, tab, accept. Forty minutes in, we have a working feature. It demos perfectly.

Then we try to change one requirement. And the whole thing unravels. Nobody — not him, not the AI — has a clear model of why the code is the way it is. The agent made a hundred local decisions, each reasonable, none written down. There's no spec, so there's no source of truth except the code itself, and the code is now a 600-line argument nobody can follow. Changing the requirement means re-deriving every one of those hundred decisions. We spend two hours undoing twenty minutes of "productivity."

This is the trap of AI-driven development as most people practice it. It feels incredibly productive because the feedback loop is so tight and the dopamine is so fast. But you're accumulating an invisible debt: a system whose design lives nowhere, owned by no one, understood by no one. I call it vibe coding, and it's fine for prototypes and poison for anything you have to maintain.

The senior engineers getting durable leverage from AI are doing almost the opposite. They've figured out that when implementation is cheap, the specification becomes the valuable, scarce artifact — and they've reorganized their entire workflow around that fact. This is that workflow.

The inversion: when code is cheap, the spec is the asset

For my whole career, the spec was the disposable part and the code was the asset. You'd sketch a rough plan, then spend most of your time writing the code, and the code became the truth. Specs went stale because keeping them current cost more than they returned.

AI inverts this. When an agent can generate a correct implementation from a clear specification in minutes, the economics flip:

If the implementation is cheap and regenerable, then the precise, reviewed, version-controlled specification is what holds the value. It's the thing that's hard to get right, the thing that encodes judgment, the thing that lets you change a requirement and regenerate cleanly instead of re-deriving a hundred buried decisions. Spec-first development means treating the spec as the primary source of truth and the code as a build artifact of it.

This isn't waterfall. The spec is living and iterative — you refine it in tight loops just like you'd refine code. The difference is that the design exists somewhere a human and an agent can both read it, instead of being implicit in code nobody can explain.

The spec-first loop

Here's the actual workflow. Four phases, and the discipline is in the gates between them.

Phase 1: Write the spec (this is the work now)

The spec is where your engineering judgment lives. It doesn't have to be heavy — it has to be precise about the things that matter:

  • Intent — what problem this solves and why, in plain language. The "why" is what lets the agent (and the next human) make good local decisions.
  • Constraints — performance budgets, security requirements, what it must not do, which patterns to follow, which to avoid.
  • Interfaces — the public surface: function signatures, API shapes, data contracts. Pin these down; they're where integration breaks.
  • Acceptance criteria — concrete, testable conditions for "done." These become your tests and your eval of whether the agent succeeded.

A good spec is dramatically shorter than the code it produces, which is exactly why it's worth the investment. I keep specs in the repo next to the code, in version control, reviewed like code. When requirements change, you change the spec and regenerate — the design history lives in the spec's git log, not in archaeology.

Phase 2: Let the agent implement — with real context via MCP

The single biggest reason agents produce wrong-but-plausible code is that they're guessing about your system. They don't know your actual database schema, your internal conventions, your existing utilities, your ticket history. So they hallucinate a reasonable-sounding version and you get code that works against an imaginary codebase.

MCP (Model Context Protocol) is the fix for this, and it's the most underused lever in AI-driven development right now. MCP is an open standard that lets an agent pull context from real sources — your database schema, your docs, your issue tracker, your filesystem, internal APIs — through a uniform interface, instead of you copy-pasting context into a chat window or hoping the model guesses right.

The practical effect: the agent implements against your real schema and conventions instead of a plausible fiction. This is the difference between an agent that needs everything spoon-fed and one that can pull what it needs. If you take one new thing from this article, it's: wire up MCP servers for the context your agents keep getting wrong. It moves quality more than upgrading the model.

A note on scope: keep the agent's loop bounded. Let it implement against the spec, run its own checks, and iterate — but within the boundaries the spec defined. An agent given an unbounded goal will wander; an agent given a precise spec and good context converges.

Phase 3: Automated gates (non-negotiable)

The agent's output goes through the same gates you'd put on any code, and these matter more with AI because the volume is higher and the author can't fully explain it:

  • Tests derived from your acceptance criteria — the spec told you what to test.
  • Type checking — types are a cheap, machine-checkable contract that catches a whole class of confident AI mistakes.
  • Lint and formatting — consistency at zero human cost.
  • Security and dependency scanning — agents will happily add a vulnerable package or write an injectable query.

These run before a human spends a second on review. Generated code that can't pass automated gates shouldn't reach a person. This is how you keep review capacity — the real bottleneck on AI-augmented teams — focused on judgment instead of catching missing semicolons.

Phase 4: Human review of intent, not syntax

This is the step everyone wants to skip and absolutely cannot. The automated gates prove the code runs and is internally consistent. They cannot prove it does the right thing. That's the human's job, and the spec makes it tractable: you're reviewing code against the spec's intent, not reverse-engineering what the code is trying to do.

The question is never "is this syntactically fine" — the gates answered that. It's "does this implement what we actually meant, and will we be able to live with it?" If the code and the spec disagree, you fix the spec (if your intent was wrong) or regenerate (if the agent missed). Either way the spec stays the source of truth.

Where autonomous PRs actually fit

The frontier of this in 2026 is autonomous PRs — you assign an agent an issue, it produces a complete pull request without a human in the implementation loop. Used well, this is real leverage. Used carelessly, it's a firehose of review debt aimed at your most scarce people.

The honest framing: autonomy should scale with blast radius, inversely. Map your work onto this:

Great autonomous-PR candidates: well-specified bug fixes, dependency bumps with passing tests, mechanical refactors, adding test coverage, boilerplate that follows an established pattern. Bad candidates: anything touching auth, money, data migrations, public interfaces, or architecture — the places where confident wrongness is expensive and a spec-with-human-in-the-loop is worth the time.

And the rule that keeps autonomous PRs from eating your team: you can only generate as many PRs as you can actually review. Autonomy without review capacity isn't speed, it's a queue of half-trusted code that someone merges under pressure on a Friday. Throttle generation to match review throughput.

The failure modes to design against

I've watched these bite every team that adopts AI-driven development without the discipline:

  • Context rot. In long agent sessions, early context gets crowded out and the agent starts contradicting decisions it made an hour ago. Fix: keep sessions scoped to a spec, start fresh for new work, and let the spec — not the chat history — be the memory.
  • Confident wrongness. AI states wrong things in the exact tone it states right things. There is no uncertainty signal in the output. Your only defense is verification: tests, types, and a reviewer who's suspicious in the right places.
  • Review debt. Generation outpaces review, half-validated code accumulates, and quality silently erodes. Fix: treat review capacity as the constraint and throttle to it (see above).
  • Skill atrophy and the broken pipeline. If juniors only accept AI output, they never build judgment. Structure their work around reviewing and defending generated code, not just producing it — the team-topology piece goes deep on this.
  • The disappearing "why." Vibe-coded systems have no design record. Spec-first fixes this by construction: the spec is the why, in version control.

What to do Monday morning

  1. Pick your next feature and write the spec first. Intent, constraints, interfaces, acceptance criteria — one page is often enough. Notice how the act of writing it surfaces decisions you'd otherwise have buried in code.

  2. Wire up one MCP server for the context your agents most often get wrong — usually your database schema or your internal docs. Measure the drop in "the AI guessed our structure wrong" moments. It's dramatic.

  3. Put gates before humans. Make sure generated code hits tests, types, lint, and security scanning before it reaches a reviewer. If it doesn't, that's your first fix — you're spending senior judgment on things a machine should catch.

  4. Classify your backlog by blast radius. Tag which items are safe autonomous-PR candidates and which need a human in the design loop. Start autonomy only where it's reversible and well-specified, and only as fast as you can review.

Key takeaways

  • Vibe coding feels productive and accumulates invisible debt. Prompt-accept-repeat with no spec produces systems whose design lives nowhere and that nobody can change confidently. Fast to demo, brutal to maintain.

  • When code is cheap, the spec is the asset. Spec-first development treats a precise, version-controlled specification as the source of truth and the code as a regenerable build artifact. The spec is where judgment lives and what lets you change requirements without archaeology.

  • MCP is the most underused lever. Giving agents real context — schema, docs, issues, internal APIs — through MCP servers fixes confident-wrong code more effectively than upgrading the model. Wire it up for whatever your agents keep guessing wrong.

  • Gates before humans; humans review intent. Automated tests, types, lint, and security scanning catch the mechanical failures so scarce human review can focus on whether the code matches the spec's intent.

  • Autonomy scales inversely with blast radius — and with review capacity. Autonomous PRs are great for well-specified, low-risk, reversible work and dangerous for auth/money/migrations/interfaces. You can only generate as many PRs as you can truly review; beyond that, autonomy is debt.

Your next step

Take the last feature you built with AI and ask: if a teammate had to change one requirement tomorrow, where would they look to understand why the code is the way it is? If the answer is "they'd have to read all the code and guess," you've been vibe coding. Write the spec that should have existed — it'll take twenty minutes and it'll show you exactly what spec-first buys you.

The agents are good and getting better. The leverage isn't in letting them run free — it's in giving them a precise target, real context, and tight gates, then spending your judgment on the one thing they can't do: deciding what "correct" means. You write the spec. They write the code. You make the call.

Frequently asked questions

What is spec-first development with AI?

Spec-first development is a workflow where you write a precise specification — intent, constraints, interfaces, and acceptance criteria — as the primary source of truth, then have an AI agent implement against it, with the code treated as a regenerable build artifact. It's a response to the economics of AI coding: when implementation becomes cheap, the specification becomes the valuable, scarce artifact because it's where judgment lives and what lets you change requirements and regenerate cleanly. It is iterative and lightweight, not waterfall — the spec is refined in tight loops, but it exists somewhere both humans and agents can read it.

What is MCP and why does it matter for AI coding?

MCP (Model Context Protocol) is an open standard that lets AI agents pull context from real sources — database schemas, documentation, issue trackers, internal APIs, the filesystem — through a uniform interface, rather than relying on copy-pasted context or the model's guesses. It matters because the main reason agents produce plausible-but-wrong code is that they're guessing about your actual system. Wiring up MCP servers so the agent implements against your real schema and conventions improves output quality more reliably than upgrading to a bigger model.

Is "vibe coding" bad?

Vibe coding — prompting, accepting, and repeating with no specification — is fine for throwaway prototypes and exploration, but harmful for code you have to maintain. It feels highly productive because the feedback loop is tight, but it accumulates invisible debt: the system's design lives nowhere, so changing a requirement means re-deriving every buried decision the agent made. For durable work, a spec-first approach keeps the design explicit and version-controlled so the system stays changeable.

When should I use autonomous PRs from AI agents?

Use autonomous PRs for work that is well-specified, low blast radius, and reversible: bug fixes with clear acceptance criteria, dependency bumps with passing tests, mechanical refactors, adding test coverage, and pattern-following boilerplate. Avoid full autonomy for anything touching authentication, payments, data migrations, public interfaces, or architecture, where confident wrongness is expensive — those deserve a human in the design loop. The hard limit is review capacity: you can only safely generate as many PRs as you can genuinely review, so throttle autonomy to match review throughput.

How do I keep AI-generated code maintainable?

Make the design explicit and verifiable. Keep a version-controlled spec as the source of truth so the "why" behind the code is recorded rather than buried. Run automated gates (tests derived from acceptance criteria, type checking, lint, security scanning) before any human review, and have humans review the code against the spec's intent rather than its syntax. Guard against context rot by scoping agent sessions to a spec, and against skill atrophy by structuring engineers' work around reviewing and defending generated code, not just producing it.

#developer-productivity#ai#spec-driven-development#mcp#ai-agents#autonomous-prs#ai-coding#ai-driven-development#vibe-coding#2026
Ruchit Suthar

Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.

Continue Reading

MCP Servers Explained: Giving Your AI Tools Real Context (A Practical Setup)
developer productivity

MCP Servers Explained: Giving Your AI Tools Real Context (A Practical Setup)

The number one reason AI coding agents produce confident, wrong code is they're guessing about your system. MCP (Model Context Protocol) fixes that — a standard way for agents to pull real context from real sources instead of you copy-pasting it. What MCP is (a USB-C port for AI tools), how to set up your first server, which context to expose (schema, docs, issues) and what to keep out, and the security model you must get right.

·12 min read
Autonomous PRs: Letting Agents Open, Review, and Merge — Safely
developer productivity

Autonomous PRs: Letting Agents Open, Review, and Merge — Safely

Autonomous PRs are real leverage and a real way to drown your best engineers in review debt. The operating model: autonomy scales inversely with blast radius, you can only generate as many PRs as you can genuinely review, the three gates every autonomous PR must pass, and the metrics that tell you it's working instead of quietly rotting your codebase.

·11 min read
Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency
software architecture

Evals for LLM Features: Building the Regression Net for a Non-Deterministic Dependency

You can't ship a reliable LLM feature on vibes. Evals are the regression net for a dependency that's non-deterministic, drifts when the provider updates the model, and fails silently. How to build one without boiling the ocean: start with 30 real examples, layer three kinds of checks (assertion, LLM-as-judge, human), measure faithfulness, and run it on every prompt, model, and retrieval change.

·11 min read