Leading Spec-First, Agent-Assisted Delivery

Make spec-first the default over vibe coding, give agents real context with MCP, and govern autonomous PRs by blast radius and review capacity.

Article 4 of 55 minAdvanced

✦

Key Takeaway

Once your team has the right people (Lessons 2–3), the question is how they ship. The failure mode is "vibe coding" at scale — prompt, accept, repeat — which feels fast and accumulates invisible debt: systems whose design lives nowhere and nobody can change confidently. The leadership move is to make spec-first the default: the specification is the source of truth, agents implement against it with real context (MCP), and everything passes gates before human review focuses on intent. And to govern autonomous PRs by blast radius and review capacity, so generation serves the team instead of burying it.

You've staffed for judgment and built a ladder that rewards it. Now your engineers are shipping with AI every day. Left ungoverned, that produces two predictable problems a leader has to get ahead of: invisible design debt from vibe coding, and review-queue collapse from unmanaged autonomous PRs. Both are leadership problems, because both are about defaults and limits, not individual skill.

Make spec-first the default, not vibe coding

Vibe coding — prompt, accept, repeat, no written design — is seductive because the feedback loop is tight and the dopamine is fast. But it accumulates debt: a hundred local decisions the agent made, recorded nowhere, that someone has to re-derive the moment a requirement changes. Fast to demo, brutal to maintain.

The inversion that fixes it: when implementation is cheap, the specification becomes the valuable, scarce artifact. Spec-first development treats a precise spec — intent, constraints, interfaces, acceptance criteria — as the source of truth, and the code as a regenerable build artifact. The spec is where judgment lives and what lets you change a requirement and regenerate cleanly instead of doing archaeology on code nobody can explain.

As a leader, your job is to make this the default path: specs live in the repo, reviewed like code; "what's the spec?" is a normal question in planning; and the design history lives in the spec's git log, not in someone's head. This isn't waterfall — specs are iterative and lightweight — it's just making the design exist somewhere a human and an agent can both read it.

Give agents real context (MCP), not guesses

The number one reason agents produce confident-but-wrong code is that they're guessing about your system. The fix is giving them ground truth — your schema, your docs, your conventions — through MCP servers, rather than relying on copy-pasted context. As a leader, treat "wire up the context our agents keep getting wrong" as infrastructure work worth prioritizing; it improves output quality more reliably than waiting for a better model. (Keep it read-only and least-privilege — an MCP server's tools define the blast radius of a prompt injection.)

Gates before humans; humans review intent

With more code flowing, you must protect your scarce review capacity (Lesson 1). The rule: generated code passes automated gates — tests, types, lint, security scanning — before a human looks. Then human review focuses on intent and correctness ("does this do the right thing and can we live with it?"), not syntax the machine already checked. This keeps senior judgment aimed at the part only humans can do.

Govern autonomous PRs by blast radius and review capacity

Autonomous PRs — agents that open complete PRs from an issue — are real leverage and a real way to drown your best people. Two governing rules:

Autonomy scales inversely with blast radius. Full autonomy for well-specified, low-risk, reversible work (dependency bumps, test coverage, mechanical refactors). A human owns the decision for auth, money, migrations, public interfaces, and architecture — places where confident-but-subtle errors are expensive.
You can only generate as many PRs as you can genuinely review. Forty agent PRs a week when you can review fifteen isn't 2.6x productivity; it's fifteen reviewed and twenty-five rotting or rubber-stamped. Throttle generation to review throughput, and raise the ceiling by improving gates and AI pre-review — not by ignoring it.

Measure latency and revert rate, not merge count. Stable review latency with rising volume means it's working; rising latency and reverts means you're shipping debt and calling it speed.

Reflect before moving on

Ask of your team's last AI-built feature: if someone had to change one requirement tomorrow, where would they look to understand why the code is the way it is? If the answer is "read all the code and guess," your default is vibe coding — and the fix is to make spec-first the path of least resistance. Then count how many PRs your team can deeply review per week; that number is the ceiling on safe autonomous-PR volume.

→ Go deeper in the companion essays: AI-Driven Development: The Spec-First Workflow and Autonomous PRs: Letting Agents Open, Review, and Merge — Safely. Finally: protecting the standard that makes all of this worth shipping.

Leveling and Career Ladders for an AI-Augmented Team

Protecting Craft and Judgment as AI Scales