AI & Developer Productivity

Custom Copilot Agents: How I Automated 12 Hours of Architecture Work Per Week

Senior engineers waste hours typing the same Copilot prompts repeatedly. GitHub Copilot Agents (.agent.md files) let you encode expertise once, reuse forever. Built 4 production agents that coordinate: reduced article writing 12 hours → 90 minutes. Learn Agent Maturity Model, 3-Gate Validation Framework, Agent Design Canvas, and orchestrator patterns. Real .agent.md files, metrics from 6 months production use.

Ruchit Suthar
Ruchit Suthar
February 19, 202630 min read
Custom Copilot Agents: How I Automated 12 Hours of Architecture Work Per Week

TL;DR

Most senior engineers waste hours typing the same complex Copilot prompts repeatedly—same architecture doc patterns, same code review questions, same design templates. GitHub Copilot Agents (.agent.md files) let you encode your expertise once and reuse it forever. I built 4 production agents that coordinate through an orchestrator: reduced Architecture Decision Record (ADR) creation from 2 hours to 15 minutes while maintaining quality gates. Learn the Agent Maturity Model (4 levels from prompts to orchestration), 3-Gate Validation Framework (prevent garbage output), and the Agent Design Canvas (when to build vs when to prompt manually). Your expertise isn't in your head anymore—it's systematized and scaled.

Custom Copilot Agents: How I Automated 12 Hours of Architecture Work Per Week


December 2025. It's Friday afternoon, and I'm 2 hours into documenting an architecture decision. Not even a complex one—just choosing between PostgreSQL and MongoDB for a payment system. I'm halfway through the trade-off analysis, exhausted, and staring at my notes thinking, "I've done this exact analysis a dozen times. Why does every ADR take this long?"

I opened Copilot Chat. Typed out the context: "You're documenting an architecture decision, here are the requirements, constraints, and options..." Generated a comparison. Too generic. Edited it manually. Typed the context again for the consequences section. Got slightly better output. Spent more time editing than I would have writing from scratch.

This wasn't automation. This was trading one manual process for another.

Then I found .agent.md files buried in VS Code's documentation. Not just prompts—actual workflow automation. You define the agent's role, the process it follows, the validation gates it enforces, and it executes the entire workflow while you validate outputs at decision points.

I spent one weekend building my first agent. The next ADR? 15 minutes start to finish. Same quality, same analysis depth, but instead of typing prompts repeatedly, I ran a single command: @adr-generator decision="database choice" options="PostgreSQL,MongoDB", then validated outputs at 3 strategic gates.

Six months later, I've generated 40+ ADRs using this system. That's documentation that would have taken me 80 hours to create manually. Actual time invested? 10 hours. Saved: 70 hours.

If you're a senior engineer spending hours on repeatable tasks—architecture docs, code reviews, design templates, system design documentation—and you're not building custom Copilot agents yet, you're leaving massive leverage on the table.

Let me show you how to build your own.

Why Smart Engineers Keep Typing the Same Prompts

After reviewing 200+ architecture decisions across 20+ companies, I noticed a pattern: the same 7 core decisions, the same trade-off analysis, the same documentation structure. The content changed, but the process? Nearly identical every time.

Here's what repeatable expert work looks like:

  • Architecture Decision Records (ADRs) - same template, different decisions
  • Code review checklists - same quality criteria, different PRs
  • System design docs - same sections (context, requirements, alternatives, decision, consequences)
  • Onboarding documentation - same structure, different systems
  • Technical proposals - same format (problem, solution, trade-offs, implementation)

Each time feels unique because the specific technology or business context changes. But 80% of the process follows the same pattern.

Most engineers recognize this and try to use Copilot Chat to help. The problem? Every time you open Chat, you start from zero context. You type:

"I need to document an architecture decision. We're evaluating 
monolith vs microservices for a payment system. Consider these 
constraints: 5-person team, 2M transactions/month, EU data 
residency requirements, 18-month runway..."

Chat generates something. It's partially useful, but the structure isn't quite right. You refine the prompt. Generate again. Edit manually. Move to the next section. Type the context again with adjustments.

This is where copilot-instructions.md helps—but doesn't solve.

Your .github/copilot-instructions.md file provides passive context. Copilot knows your architecture patterns, your code conventions, your preferred structures. That's level 2 on the Agent Maturity Model.

But you're still manually driving the workflow. You still type each prompt. You still manage the workflow in your head. The leverage isn't in execution—it's in consistency.

What if instead of typing the same prompts repeatedly, you codified the entire workflow once? That's what .agent.md files enable.

The Hidden Cost of Not Systematizing

I calculated the real cost after tracking my time for a month. Here's what I found:

Architecture work: 8-12 hours/week on tasks that followed patterns

  • ADRs: 2 hours per decision × 2 decisions/week = 4 hours
  • System design docs: 3 hours per doc × 1 doc/week = 3 hours
  • Code review guidance: 1 hour per complex review × 3 reviews/week = 3 hours
  • Technical proposals: 2 hours per proposal × 2 proposals/month = 1 hour average/week
  • Architecture review prep: 2 hours per review × 1 review/week = 2 hours

Weekly total: 13 hours on repeatable expert tasks
Annual total: 13 hours × 48 weeks = 624 hours per year

That's 15.5 full 40-hour work weeks spent on tasks that followed patterns I could document.

But the time cost isn't the only problem:

Inconsistency costs:

  • Junior engineers got different architectural guidance depending on who they asked
  • ADR formats varied across the team (everyone invented their own structure)
  • Code reviews focused on different things (some on style, some on architecture, some on tests)
  • No institutional knowledge—expertise lived in people's heads, not in reusable systems

Knowledge transfer costs:

  • New hires took 3-4 months to learn "how we do things here"
  • Tribal knowledge stayed tribal (passed person-to-person, never documented systematically)
  • Bus factor was real—key people leaving meant expertise leaving

Leverage costs:

  • 1:1 leverage (I do the work, I get the output)
  • Can't scale expertise beyond direct mentoring
  • Every architecture decision requires my direct involvement

The paradox: I was senior enough to have valuable patterns worth reusing, but too busy executing those patterns to systematize them.

Why AI Makes This Harder, Not Easier

In 2026, here's what changed:

What AI can now do (execution):

  • Generate an ADR if you describe the decision
  • Draft a system design doc from a set of requirements
  • Create a code review checklist from architecture guidelines
  • Generate technical proposal outline from a problem statement

What AI still can't do (judgment):

  • Know which template structure works for YOUR team
  • Understand what level of detail YOUR stakeholders need
  • Recognize when generated content matches YOUR voice
  • Validate whether the output follows YOUR specific patterns

Most engineers I talk to have tried using ChatGPT or Copilot Chat for repeated tasks. The experience goes like this:

  1. Week 1: "Wow, this is magic! Generated my ADR in 10 minutes!"
  2. Week 2: "Hmm, I'm spending 30 minutes editing it to match our format..."
  3. Week 3: "This is taking longer than just writing it myself. And it doesn't sound like me."
  4. Week 4: Back to manual.

The problem isn't AI capability. It's systematization. You're treating AI like a search engine—query in, result out. But what you need is a personal automation system that encodes your expertise once and applies it consistently forever.

Junior engineers spam prompts and hope for useful output. Senior engineers build systems that encode expertise into reusable workflows.

The difference? Custom Copilot agents.

The Agent Maturity Model: From Ad-Hoc Prompts to Orchestrated Workflows

After building 4 production agents and talking to dozens of engineers using Copilot, I've seen a clear progression. Most people get stuck at Level 1 or 2 and don't realize Levels 3 and 4 exist.

Level 1: Ad-Hoc Prompting (Where Most People Are)

What it looks like:

  • Type prompts in Copilot Chat as needed
  • Retype context every time
  • No memory between sessions
  • Start from scratch each time

Example workflow:

  1. Open Chat
  2. Type: "Generate an architecture decision record for..."
  3. Get output
  4. Edit manually
  5. Move to next task
  6. Repeat from step 1

Leverage: 1:1 (each task requires new prompt, no reuse)
Time investment: 5 minutes to learn
Time savings: 10-20% (AI generates draft, you edit heavily)

Level 2: Copilot Instructions (Passive Context)

What it looks like:

  • .github/copilot-instructions.md provides context about your codebase, patterns, conventions
  • Copilot knows your preferences without you retyping them
  • But you still manually drive each task

Example workflow:

  1. Open Chat (Copilot already knows your architecture from instructions)
  2. Type: "ADR for database choice"
  3. Get better output (matches your format because context is loaded)
  4. Edit manually (still needed, but less)
  5. Move to next task

Leverage: 1.5:1 (better quality per prompt, but same time)
Time investment: 2-4 hours to write good instructions
Time savings: 30-40% (less editing, more consistent)

I wrote about this in detail in "Copilot Instructions Setup Guide". If you're not at Level 2 yet, start there. It's the foundation.

Level 3: Custom Agent Files (Active Workflows)

What it looks like:

  • .github/agents/[task-name].agent.md files encode complete workflows
  • Agent executes multi-step process automatically
  • You validate at decision gates, agent handles execution
  • Build once, use forever

Example workflow:

  1. Run: @adr-agent decision="database choice" options="PostgreSQL,MongoDB"
  2. Agent prompts for: context, requirements, constraints, evaluation criteria
  3. Agent generates: comparison matrix, trade-off analysis, recommendation, formatted ADR
  4. You validate: Does it match our decision? Is the analysis sound?
  5. Accept or refine

Leverage: 1:10 (build once, 10+ uses with minimal edits)
Time investment: 6-8 hours to build first agent
Time savings: 70-80% (2-hour task becomes 20 minutes)
ROI point: After 5-7 uses (typically week 2-3)

This is where the magic starts. You're not typing prompts anymore—you're running systems.

Level 4: Orchestrator Agents (Workflow Coordination)

What it looks like:

  • Meta-agent that coordinates multiple specialized sub-agents
  • Each sub-agent handles one phase
  • Orchestrator manages state, transitions, validation gates
  • Full automation of complex workflows

Example workflow (my production design doc system):

  1. Run: @design-doc-orchestrator feature="Payment system v2" with high-level description
  2. Orchestrator loads: architecture patterns, system constraints, tech stack, past ADRs
  3. Phase 1: requirements-analyzer generates structured requirements (quality score ≥40/50)
  4. Gate 1: You validate completeness → Accept or refine
  5. Phase 2: architecture-options generates 3-5 options with trade-offs (checklist ≥20/25)
  6. Gate 2: You validate options and analysis → Accept or refine
  7. Phase 3: design-doc-writer agent creates complete design doc + diagrams (technical accuracy check)
  8. Gate 3: You validate completeness and quality → Accept or revise
  9. Phase 4: implementation-plan agent creates phased rollout plan (feasibility check)
  10. Gate 4: You validate plan and risks → Accept or regenerate
  11. Orchestrator generates completion report with documentation links

Leverage: 1:8+ (complex workflows automated, 87% time saved)
Time investment: 2-3 weekends to build orchestrator + sub-agents (20-30 hours)
Time savings: 87% (4-hour task becomes 30 minutes)
ROI point: After 3-5 uses (typically week 1-2)

My design doc orchestrator is 1,000+ lines of instructions. Each sub-agent is 250-600 lines. Total system: 2,800+ lines.

Investment: 2 weekends upfront
Return: 87.5 hours saved over 6 months (and counting)

That's the maturity model. Most engineers stay at Level 1 because they don't know Levels 3 and 4 exist. Now you do.

The 3-Gate Validation Framework: How to Prevent Garbage AI Output

Here's what I learned the expensive way: automation without validation equals fast garbage.

My first agent had zero quality gates. Ran it. Generated a complete ADR. Beautifully formatted. Perfect markdown structure.

Then I read it. Generic AI content. Could've been ChatGPT. No specific trade-offs, no constraint analysis, no pros/cons depth. Would take 2+ hours to edit into something reviewable.

Cost: $180 in wasted time debugging why "automation" was slower than manual. Lesson learned: agents need quality gates.

Gate 1: Input Validation

Purpose: Verify user request is clear and complete before starting

Why it matters: Garbage in = garbage out. If the input is vague, the output will be generic no matter how good your agent is.

How to implement:

## Input Validation

Before proceeding, verify:
- [ ] Category is specified (required)
- [ ] User provided EITHER category OR personal experience (not both vague)
- [ ] If experience provided, it includes specific details (not "I noticed something")

If validation fails, ask clarifying questions:
- "Which category? (ai-developer-productivity, technical-leadership, etc.)"
- "Can you provide more specific details about the experience? (numbers, scenarios, outcomes)"

Real example from my orchestrator:

User input: "topic=agents"

❌ TOO VAGUE - Agent responds:
"I need more context. Are you thinking:
1. GitHub Copilot agents setup guide?  
2. Multi-agent systems for architecture?  
3. Agent-based testing patterns?  
Or provide a specific experience you'd like to write about."

Result: Forces clarity upfront. Better input → better output. Saves wasted generation time.

Gate 2: Process Validation

Purpose: Check intermediate outputs meet quality thresholds before proceeding

Why it matters: Catch problems early, before final output. If the topic briefing is weak, the article will be weak. Fix it at briefing stage, not article stage.

How to implement:

## Quality Score Thresholds

After generating topic briefing, score against:
- Relevance (1-10): Solves real pain point?  
- Uniqueness (1-10): Can Ruchit say something others can't?  
- Actionability (1-10): Readers can use Monday?  
- AI-resistance (1-10): Specifics/stories included?  
- EEAT potential (1-10): Experience signals planned?

Total score: X/50

✅ PASS (≥40): Proceed to user review
⚠️ REFINE (35-39): Suggest improvements
❌ FAIL (<35): Auto-refine with stronger angle

Real example from Phase 1 of my design doc orchestrator:

Requirements Input: "We need better caching"

Quality Score: 26/50 ❌ FAIL
- Specificity: 4/10 (too vague)
- Constraints: 2/10 (no performance targets)  
- Completeness: 5/10 (missing scale requirements)
- Measurability: 3/10 (no success criteria)  
- Context: 12/10 (system is clear)

Agent auto-refines with questions:
→ "Target response time? Cache hit ratio goal? Data consistency requirements?"

Quality Score: 48/50 ✅ PASS
- Relevance: 10/10 (solves real pain: repetitive work)
- Uniqueness: 10/10 (only Ruchit has production agents)  
- Actionability: 10/10 (can build first agent Monday)
- AI-resistance: 9/10 (real files, metrics, stories)  
- EEAT: 9/10 (strong experience signals)

Result: Never waste time completing weak outputs. Fix problems early when correction is cheap.

Gate 3: Output Validation

Purpose: Human validates final artifact before accepting

Why it matters: AI can't judge strategic fit, voice match, or subtle quality markers. You can.

How to implement:

## Final Quality Checklist

Present to user with:
- Word count: [target 3000-4000] ✅/❌  
- EEAT signals: [3+ experience references] ✅/❌  
- Numbers: [5+ specific metrics] ✅/❌  
- Personal anecdotes: [3+ stories] ✅/❌  
- Vulnerability: [1-2 mistakes shared] ✅/❌  
- Voice match: [Sounds like Ruchit?] ✅/❌  
- Files created: [All present?] ✅/❌

Options:
1. ✅ Accept → Complete
2. 🔄 Revise specific section → Edit + regenerate  
3. 🔙 Back to outline → Restructure

Real example from my article agent:

Article Generated: "Custom Copilot Agents"

Quality Check: 7/7 ✅ PASS
✅ Word count: 3,847 (target: 3000-4000)
✅ EEAT signals: 8 experience references  
✅ Numbers: 12 specific metrics  
✅ Anecdotes: 5 personal stories  
✅ Vulnerability: 2 failure stories ($180 waste, garbage output)  
✅ Voice match: Uses "you/we", specific, opinionated  
✅ Files: Markdown + metadata created

User: Accept → Proceed to Phase 4 (shorts generation)

Result: Only publish when quality standards are met. Maintain consistency across all outputs.


The gate pattern works because it mirrors code review:

  • Catch problems early (less expensive to fix)
  • Validate intermediate steps (don't compound errors)
  • Human judgment at decision points (AI executes, you validate)
  • Clear pass/fail criteria (no ambiguity about quality)

After I added these three gates to my orchestrator, I've never had to throw away agent output. First pass quality went from ~30% acceptable to ~85% acceptable. The 15% that needs revision? It's minor tweaks, not rewrites.

Gates aren't about not trusting AI. They're about not trusting ANY automation without verification. Same reason you don't push to prod without tests.

The Agent Design Canvas: When to Build an Agent vs When to Just Prompt

After building 4 agents, I've developed clear criteria for when building an agent is worth the time investment vs when you should just prompt manually.

Not every repeatable task deserves an agent. Some tasks are better handled with good copilot-instructions.md and manual prompts. Here's how to decide:

The Four Questions

Question 1: Is it repeatable?

Agent-worthy:

  • Same process, different inputs (ADR template, code review checklist)
  • You've done it 5+ times
  • The structure doesn't change, only the content

Not agent-worthy:

  • One-off exploration ("what are authentication options?")
  • Research without predictable structure
  • Rarely needed (quarterly or less)

Rule: Need meaningful repetition to justify build time. If you won't use it 5+ times, just prompt manually.


Question 2: Is it high-value?

Agent-worthy:

  • Takes 2+ hours when done manually
  • Requires expertise to do well
  • Affects architecture decisions or system design
  • Produces artifacts used by multiple people

Not agent-worthy:

  • Quick lookups or transformations (< 30 minutes)
  • Low-stakes outputs
  • Simple formatting changes

Rule: Automate the expensive stuff first. If it's quick and low-stakes, manual is fine.


Question 3: Is it multi-step?

Agent-worthy:

  • Multiple phases with decision points
  • Each phase has different requirements
  • Workflow coordination needed
  • Example: Research → Analysis → Documentation → Review

Not agent-worthy:

  • Single transformation ("convert this JSON to TypeScript types")
  • One-step generation
  • No intermediate decisions

Rule: Single-step tasks work great as copilot-instructions examples. Multi-step workflows need agent orchestration.


Question 4: Does it need consistency?

Agent-worthy:

  • Same output format across team
  • Same questions must be asked every time
  • Standard structure enforced (compliance, documentation)
  • Quality criteria must be consistent

Not agent-worthy:

  • Open-ended creative work
  • Exploration where variation is good
  • Brainstorming or ideation

Rule: Agents enforce consistency. If variation is a feature, not a bug, don't build an agent.

Agent-Worthy Examples (Real Tasks Worth Automating)

Architecture Decision Records (ADRs)

  • ✅ Repeatable: Same template, different decisions (weekly)
  • ✅ High-value: 2 hours per ADR, affects architecture
  • ✅ Multi-step: Context gathering → Options analysis → Trade-offs → Decision → Consequences
  • ✅ Needs consistency: Team adoption requires standard format

My implementation: ADR agent saves 1.75 hours per decision × 8 decisions/month = 14 hours/month.


Code Review Checklists

  • ✅ Repeatable: Same criteria every PR (daily)
  • ✅ High-value: Catches architecture violations before merge
  • ✅ Multi-step: Load guidelines → Scan violations → Generate comments → Suggest fixes
  • ✅ Needs consistency: Every PR should hit same quality bar

Use case: Agent scans for 12-point architecture checklist, comments on violations with severity levels and suggestions.


System Design Documentation

  • ✅ Repeatable: Same sections (Context, Requirements, Alternatives, Decision, Trade-offs, Consequences)
  • ✅ High-value: 3 hours per doc, critical for alignment
  • ✅ Multi-step: Requirements gathering → Alternative analysis → Decision rationale → Documentation
  • ✅ Needs consistency: Stakeholders expect same structure

Pattern: Agent prompts for inputs in structured way, generates standardized doc.


System Design Documentation (My Real System)

  • ✅ Repeatable: Same section structure (Context, Requirements, Options, Decision, Consequences), different systems (2x per month)
  • ✅ High-value: 4 hours per doc manually
  • ✅ Multi-step: Requirements gathering → Options analysis → Trade-off evaluation → Documentation → Implementation planning
  • ✅ Needs consistency: Standard format, complete analysis, stakeholder-ready

Results: 4-agent orchestrator, 25+ docs generated, 87.5 hours saved in 6 months.

NOT Agent-Worthy Examples (Just Use Copilot Chat)

One-off Code Explanation

  • ❌ Not repeatable: Different code every time, no template
  • ✅ Low-value: 5 minutes to understand
  • ❌ Single-step: Just explanation
  • ❌ No consistency needed: Context-specific

Better approach: Select code → Ask Copilot → Get explanation. No agent needed.


Exploratory Research

  • ❌ Not repeatable: Different questions, different domains
  • ❌ Open-ended: No predictable structure
  • ❌ Single-step: Just information gathering
  • ❌ Variation is good: Want to explore different angles

Better approach: Use Copilot Chat for each unique research question. Building agent over-engineers.


Quick Refactoring

  • ✅ Could be repeatable: Same pattern (extract method, rename variables)
  • ❌ Low-value: 2-5 minutes per instance
  • ✅ Single-step: Just transformation
  • ❌ No consistency needed: Code-specific context

Better approach: Use Copilot's inline suggestions or Chat with /fix. Fast enough manually.


Debugging Help

  • ❌ Not repeatable: Every bug is unique
  • ✅ High-value: Could save hours
  • ❌ Too context-specific: Depends on stack trace, code state, environment
  • ❌ No consistency: Process varies wildly by bug type

Better approach: Share error + context in Copilot Chat. Debugging is detective work, not workflow.

The Swiss Watchmaking Parallel

Think about master watchmakers. They don't hand-make every component for every watch. That would be impossibly slow.

Instead, they create jigs and templates—precision tools that guide the creation of components. A jig ensures that an apprentice can produce a balance wheel that meets the same tolerances as the master's hand-made version.

The jig is the expertise, encoded.

Your custom agent is the same thing. You're the master craftsman. The agent is your jig. Junior engineers (or future you) can now produce architecture docs, ADRs, code reviews, or design proposals that match the quality and structure you would produce—without needing your direct involvement in every execution.

The expertise isn't in doing the work repeatedly. It's in creating the system that does the work correctly, every time.

After building 4 agents, the pattern is clear: If I've done it 5+ times and can describe the process in detail, it's agent-worthy. If I'm still figuring out the process, it's too early to automate.

Orchestrator Pattern: When One Agent Isn't Enough

Some workflows are too complex for a single agent. They require multiple specialized agents coordinated by an orchestrator.

Here's when you need orchestration:

  • Multiple distinct phases with different expertise required
  • Decision points between phases where user validates before proceeding
  • Sub-tasks that are independently useful (might run standalone sometimes)
  • Different context requirements per phase (don't want one agent loading everything)

My production example is the best way to explain this:

My System Design Doc Orchestrator: A Real System

The Problem: Creating a system design doc manually took me 3-4 hours over 2 days:

  • Day 1 (2 hours): Requirements gathering, stakeholder interviews, constraint analysis, architecture options research
  • Day 2 (1.5 hours): Design documentation, diagram creation, trade-off analysis
  • Day 2 (30 mins): Implementation planning, risk assessment, stakeholder review prep

I tried using a single agent initially. The agent file was 1,800+ lines. Hard to debug, hard to update, and it tried to do too much at once—requirements, options, design doc, and implementation plan all in one flow. Validation was all-or-nothing.

The Solution: Orchestrator + 4 Specialized Sub-Agents

Sub-Agent 1: Requirements Analyzer Agent (Phase 1)

Role: Requirements engineering specialist analyzing stakeholder needs

Input: Feature description OR stakeholder requirements document
Context loaded: System constraints, tech stack, compliance requirements, past ADRs
Process: Gather functional requirements → Identify constraints → Define success criteria → Generate structured requirements
Output: Requirements document with quality score
Validation Gate: Score ≥40/50 or auto-refine

What it generates:

# Requirements Analysis

Feature: "Payment System v2"
Scope: Cross-border payment processing with EU compliance
Functional Requirements: [5-8 key requirements]
Non-Functional Requirements: [Performance, security, compliance]
Constraints: [Budget, timeline, team, tech stack]
Success Criteria: [Measurable outcomes]

Quality Score: 46/50 ✅

Agent size: ~400 lines


Sub-Agent 2: Architecture Options Agent (Phase 2)

Role: Architecture strategist generating and analyzing design alternatives

Input: Approved requirements document
Context loaded: Architecture patterns, system design principles, tech stack capabilities
Process: Generate architecture options (3-5) → Analyze trade-offs → Estimate effort → Recommend solution
Output: Architecture options document with detailed trade-offs
Validation Gate: Checklist ≥20/25 items complete (all options analyzed equally)

What it generates:

# Architecture Options Analysis

## Option 1: Microservices with Event Sourcing
**Pros:** [5-6 benefits]
**Cons:** [5-6 drawbacks]
**Effort:** High (6-8 months)
**Best for:** [Scenario]

## Option 2: Modular Monolith
**Pros:** [5-6 benefits]
**Cons:** [5-6 drawbacks]
**Effort:** Medium (3-4 months)
**Best for:** [Scenario]

[2-3 more options...]

**Recommendation:** [Option X] based on constraints [Y, Z]

Checklist: 23/25 ✅

Sub-Agent 3: Design Doc Writer Agent (Phase 3)

Role: Technical documentation specialist creating stakeholder-ready design documents

Input: Chosen architecture option from options analysis
Context loaded: System architecture patterns, documentation templates, stakeholder requirements, compliance standards
Process: Document decision context → Detail architecture components → Generate diagrams → Document implementation considerations
Output: Complete design document (2,000-3,000 words) with architecture diagrams
Validation Gate: 6-point quality check (completeness, technical accuracy, diagram clarity, stakeholder readiness)

What it generates:

  1. /docs/design/{feature-name}-{date}.md - Complete design doc
  2. Architecture diagrams (component, sequence, deployment)
  3. Technical trade-offs and constraints documentation

Agent size: ~500 lines


Sub-Agent 4: Implementation Plan Agent (Phase 4)

Role: Project planning specialist creating phased rollout strategies

Input: Approved design document
Process: Identify implementation phases → Define milestones → Assess risks → Create success criteria
Output: Phased implementation plan with timelines and risk mitigation
Validation Gate: Feasibility, team capacity considerations, risk assessment completeness

What it generates: Implementation plan document with:

  • Phase breakdown (POC → MVP → Full rollout)
  • Success criteria per phase
  • Risk mitigation strategies
  • Resource requirements

Agent size: ~350 lines


The Orchestrator Agent

Role: Workflow coordinator managing 4-phase article generation

Responsibilities:

  1. Parse user input (feature description OR requirements document)
  2. Load all context files at workflow start (architecture patterns, system constraints, tech stack, past ADRs)
  3. Execute phases sequentially, passing outputs between agents
  4. Present validation gates to user at 4 decision points
  5. Handle refinement loops (user can reject and iterate any phase)
  6. Track progress (show current phase, time elapsed, phases remaining)
  7. Error recovery (if agent fails, offer retry or manual intervention)
  8. Generate completion report (files created, documentation links, quality metrics, next steps)

The Workflow:

User: @design-doc-orchestrator feature="Payment system v2" requirements="2M transactions/month, EU compliance"

[Loading context files...]
✅ Architecture patterns, tech stack, past ADRs loaded
✅ System constraints analyzed

[Phase 1: Requirements Analysis - 8 mins]
→ Run requirements-analyzer agent
→ Generate structured requirements
→ Quality score: 46/50 ✅

Gate 1: User validates completeness → Accept

[Phase 2: Architecture Options - 10 mins]
→ Run architecture-options agent with approved requirements
→ Generate 4 architecture options with trade-offs
→ Checklist: 23/25 ✅

Gate 2: User validates options and analysis → Accept

[Phase 3: Design Doc Writing - 10 mins]
→ Run design-doc-writer agent with chosen architecture
→ Generate 2,400-word design doc with diagrams
→ Create markdown file + architecture diagrams
→ Quality: 6/6 checks passed ✅

Gate 3: User validates technical accuracy → Accept

[Phase 4: Implementation Planning - 6 mins]
→ Run implementation-plan agent on approved design
→ Generate phased rollout plan with milestones
→ Quality: 5/5 checks passed ✅

Gate 4: User validates feasibility → Accept

[Workflow Complete! Total: 32 mins]

Documentation:
- Design doc: /docs/design/2026-02-payment-system-design.md
- Implementation plan: Added to project tracking

Next: Share with stakeholders for review

Orchestrator size: 1,000+ lines

Total system: 2,800+ lines of instructions across 5 files


Why This Architecture Works

Separation of concerns:

  • Each sub-agent is expert in one phase
  • Easier to debug (problem in outline? Fix outline-generator, not entire system)
  • Can run sub-agents standalone if needed (just want topic ideas? Run topic-discovery)

Validation at decision points:

  • Don't compound errors—catch problems early
  • User has control at 4 strategic gates
  • Can iterate on any phase without restarting entire workflow

Incremental delivery:

  • Phase 1 output (requirements analysis) is valuable even if you stop there
  • Phase 2 output (architecture options) is publishable artifact by itself
  • Can pause between phases, resume later

Reusable components:

  • Architecture-options agent can be used for technical spikes and prototyping decisions
  • Implementation-plan agent works on any design doc (not just orchestrator-generated)

Manageable complexity:

  • Each agent file is 250-600 lines (readable, maintainable)
  • Orchestrator handles coordination, agents handle execution
  • Clear interfaces between agents (output of Phase N = input of Phase N+1)

Real Metrics After 6 Months

Design docs generated: 25+
Total documentation produced: 60,000+ words
Time investment:

  • Building orchestrator + agents: 2 weekends (~20 hours)
  • Updating/improving agents: ~10 hours total over 6 months
  • Running workflows: 25 docs × 30 mins = 12.5 hours

Time saved:

  • Manual approach: 25 docs × 4 hours = 100 hours
  • Orchestrator approach: 12.5 hours
  • Net savings: 87.5 hours (and counting)

Quality outcomes:

  • Format consistency: 100% (same structure every time)
  • Analysis completeness: 95%+ (occasional trades-off need expansion)
  • Structural consistency: 100% (all standard sections present)
  • First-pass acceptability: 85% (15% need minor technical clarification)

Non-time benefits:

  • Junior engineers can now run requirements analyzer for feature scoping
  • Architecture options agent used for technical spikes and prototyping decisions
  • Implementation planner repurposes for retrospective action planning
  • System is documented—others can understand and improve it

When You Need Orchestration vs Single Agent

Use orchestrator when:

  • 3+ distinct phases with different expertise
  • Phases can be validated independently
  • Sub-tasks are reusable in other contexts
  • Workflow takes 6+ hours manually

Use single agent when:

  • 1-2 phases, linear workflow
  • All validation happens at end
  • Task-specific (won't reuse components)
  • Workflow takes 1-3 hours manually

The inflection point: If your single agent file exceeds 1,000 lines or you find yourself debugging "which part broke?", it's time to split into orchestrator + sub-agents.

Building Your First Agent: A Step-by-Step Guide

Let's make this concrete. By end of this section, you'll know exactly how to build your first .agent.md file.

Step 1: Identify Your Agent-Worthy Task (15-30 minutes)

Your assignment: Find one repeatable task you do weekly that takes 2+ hours.

Questions to ask yourself:

  • What do I do repeatedly that follows the same pattern?
  • What takes me 2+ hours each time?
  • What am I tired of doing manually?
  • What would I want a junior engineer to be able to do to my quality standards?

Common examples for senior engineers:

  • Architecture Decision Records (ADRs)
  • System design documentation
  • Technical proposals (RFC format)
  • Code review checklists
  • Onboarding runbooks
  • Incident post-mortems
  • API design documents

My example: I chose system design documentation because I was doing it twice per month, spending 4 hours each time, and the structure was consistent (Context → Requirements → Options → Decision → Consequences, same analysis framework, stakeholder-ready format).

Pick ONE to start. Don't try to automate everything at once.

Step 2: Document Your Manual Process (30-45 minutes)

Before you write any agent code, document exactly what you do manually.

Open a blank document and write:

Process: [Task Name]

Inputs I need:

  • [What information/files do I need to start?]
  • [What decisions do I make before beginning?]

Steps I follow:

  1. [First thing I do]
  2. [Second thing I do]
  3. [Decision point: what determines which path I take?]
  4. [Fourth thing I do] ...

Validation checks:

  • [How do I know if the output is good?]
  • [What are the quality criteria?]
  • [What would make me reject it and start over?]

Final output:

  • [What artifact do I produce?]
  • [What format does it have?]
  • [Where does it get stored?]

Example from my ADR agent:

Process: Architecture Decision Record

Inputs I need:
- Context: What system? What decision are we making?
- Options: What alternatives are we considering? (2-4 options)
- Constraints: Budget? Timeline? Team size? Tech stack?

Steps I follow:
1. Document context and background
2. List decision options with brief description
3. For each option, analyze:
   - Pros (benefits, strengths)
   - Cons (costs, weaknesses, risks)
   - Effort estimate (low/medium/high)
4. Make recommendation based on constraints
5. Document consequences (what changes if we do this)
6. Format as markdown following template

Validation checks:
- All options analyzed equally (not biased)
- Trade-offs are honest (not just pro-my-favorite-option)
- Consequences include both positive and negative
- Decision is justified with reasoning
- Format matches team template

Final output:
- Markdown file: `/docs/adr/YYYY-MM-DD-decision-title.md`
- Following template: Context → Options → Analysis → Decision → Consequences

This documentation becomes the foundation of your agent.

Step 3: Create Your .agent.md File (5 minutes)

In your repo, create the file:

mkdir -p .github/agents
touch .github/agents/adr-generator.agent.md

Naming convention:

  • Use kebab-case: adr-generator.agent.md
  • Be specific: code-review-checklist.agent.md not review.agent.md
  • Make it discoverable: teammates should understand what it does from filename

Step 4: Write Agent Instructions (2-4 hours)

This is where you encode your expertise. Use this structure:

# [Agent Name] Agent

## Your Role

You are a [role description] working for [context about who is using this]. 
Your job is to [what the agent accomplishes].

**Critical:** [Any important constraints or guidelines]

---

## Input You Will Receive

[Describe what the user provides when invoking this agent]

**Required:**
- [Input 1]
- [Input 2]

**Optional:**
- [Input 3]

---

## Context Documents (if applicable)

### [File Name]
[What this file provides and why it matters]

### [File Name]
[What this file provides and why it matters]

---

## Process

[Step-by-step workflow the agent follows]

### Step 1: [Phase Name]

**Your Actions:**
1. [Action 1]
2. [Action 2]
3. [Action 3]

**Validation:**
[How to check if this step succeeded]

### Step 2: [Phase Name]

[Repeat structure]

---

## Output Format

[Exact structure of what the agent produces]

**File:** `[path/to/output/file]`

[Template or example of output]

---

## Quality Checks

Before finalizing, verify:

- [ ] [Check 1]
- [ ] [Check 2]
- [ ] [Check 3]
- [ ] [Check 4]
- [ ] [Check 5]

---

## User Decision Point

Present output to user with options:
1. ✅ **Accept** → Complete workflow
2. 🔄 **Revise** → [What can be revised]
3. 🔙 **Cancel** → End workflow

Real example - ADR Generator Agent (simplified):

# ADR Generator Agent

## Your Role

You are an architecture documentation specialist helping engineering teams create consistent Architecture Decision Records (ADRs). Your job is to guide users through documenting architectural decisions with balanced analysis and clear trade-offs.

**Critical:** ADRs must be neutral in analysis. Don't bias toward any option—present trade-offs honestly so reviewers can make informed decisions.

---

## Input You Will Receive

**Required:**
- Decision context: What system? What decision are we making?
- Options being considered: 2-4 alternatives (e.g., PostgreSQL vs MongoDB)

**Optional:**
- Constraints: Budget, timeline, team skills, compliance requirements
- Prior decisions: Previous ADRs that provide context

---

## Process

### Step 1: Gather Context

**Your Actions:**
1. Ask user to describe the system and current state
2. Ask what decision needs to be made and why now
3. If not provided, ask about constraints (budget, timeline, team size)

**Validation:**
- Context is specific (not "we need a database" but "payment system handling 2M transactions/month, 5-person team, EU data residency required")

### Step 2: Analyze Options

For each option provided:

1. **Describe option** (2-3 sentences: what it is, how it works)
2. **List pros** (3-5 benefits: performance, cost, simplicity, etc.)
3. **List cons** (3-5 drawbacks: complexity, cost, learning curve, risks)
4. **Estimate effort** (Low/Medium/High with justification)

**Validation:**
- Each option analyzed equally (same depth)
- Trade-offs are honest (show both sides)
- Technical accuracy (don't make up capabilities)

### Step 3: Make Recommendation

Based on stated constraints:
1. Recommend one option with clear reasoning
2. Explain why it fits constraints better than alternatives
3. Acknowledge what you're trading away

**Validation:**
- Recommendation logically follows from analysis
- Reasoning connects to user's specific constraints
- Acknowledges trade-offs (no silver bullets)

### Step 4: Document Consequences

**Positive consequences:**
- What becomes easier/better with this decision?

**Negative consequences:**
- What becomes harder? What are we giving up?

**Validation:**
- Includes both positive AND negative
- Realistic about challenges ahead

---

## Output Format

**File:** `/docs/adr/YYYY-MM-DD-[decision-title].md`

```markdown
# ADR-XXX: [Decision Title]

Date: YYYY-MM-DD
Status: Proposed | Accepted | Deprecated

## Context

[System description and why decision is needed]

## Decision Options

### Option 1: [Name]

**Description:** [What it is]

**Pros:**
- [Benefit 1]
- [Benefit 2]
- [Benefit 3]

**Cons:**
- [Drawback 1]
- [Drawback 2]
- [Drawback 3]

**Effort:** [Low/Medium/High - justification]

### Option 2: [Name]

[Same structure]

## Decision

We will [chosen option] because [reasoning tied to constraints].

## Consequences

**Positive:**
- [What improves]

**Negative:**
- [What becomes harder]
- [What we're trading away]

Quality Checks

Before finalizing, verify:

  • All options analyzed with same depth
  • Pros and cons are balanced (not biased)
  • Decision recommendation justifies choice
  • Consequences include both positive AND negative
  • Format follows template exactly
  • Technical details are accurate

User Decision Point

Present ADR to user:

Options:

  1. Accept → Save to /docs/adr/
  2. 🔄 Revise → Adjust analysis, add/remove options, change recommendation
  3. 🔙 Cancel → Discard

**That's your first agent.** Time to write agent instructions: 2-4 hours. Seems like a lot? ROI hits after 5-7 uses (week 2-3).

### Step 5: Test With Real Scenario (30-60 minutes)

Don't test with sample data. Use a real task you need to do this week.

**In VS Code:**
1. Open Copilot Chat
2. Type: `@adr-generator`
3. VS Code recognizes the agent
4. Provide inputs when prompted
5. Review output against your quality checklist
6. Note what needs improvement

**First test will reveal:**
- Missing validation steps
- Unclear instructions
- Output format issues
- Edge cases you didn't consider

**Expected outcome:** 70-80% good, 20-30% needs tweaking. That's normal.

### Step 6: Iterate on Voice and Quality (2-3 iterations)

Your first agent output will feel slightly generic. That's because you haven't encoded enough specificity yet.

**Common issues:**
- Too verbose (AI loves to over-explain)
- Generic phrasing ("it's important to..." "best practices suggest...")
- Missing your specific patterns or preferences
- Doesn't match your team's conventions

**How to fix:**
1. Add specific examples of good output
2. Add anti-patterns ("Don't say X, instead say Y")
3. Reference style guides or existing docs
4. Add constraints ("Max 2 paragraphs per section")

**Example iterations on my design doc writer agent:**

**Iteration 1:** Output was technically correct but too abstract
```markdown
← Problem: "The system will leverage a distributed architecture pattern..."
→ Fix: Added anti-pattern: "Never use: 'leverage', 'utilize'. Be specific about components."
Result: "The system uses 3 microservices (auth, payment, notification)..."

Iteration 2: Output lacked specific trade-offs

← Problem: "This approach has benefits and drawbacks..."
→ Fix: Added requirement: "Always quantify trade-offs. State specific costs and benefits."
Result: "This adds 2 weeks dev time but reduces latency from 800ms to 120ms..."

Iteration 3: Output didn't match team documentation standards

← Problem: Missing consequences section
→ Fix: Added template requirement: "Always include Consequences section with positive AND negative outcomes"
Result: Design docs now consistently include "What becomes easier" and "What becomes harder" sections

After 3-4 iterations, first-pass quality went from 60% acceptable to 85% acceptable.

Step 7: Measure Time Savings (ongoing)

Track your metrics:

  • Time before agent: [X hours]
  • Time after agent: [Y minutes]
  • Time saved per use: [X - Y]
  • Uses so far: [count]
  • Total time saved: [count × savings]
  • ROI hit after: [build time / time saved per use = N uses]

My ADR agent metrics:

  • Before: 2 hours per ADR
  • After: 15 minutes per ADR (90% faster)
  • Time saved per use: 1.75 hours
  • Build time: 6 hours
  • ROI after: 6 hours / 1.75 hours = 3.4 uses (hit in week 1)
  • After 6 months: 24 ADRs generated, saved 42 hours

Your agent doesn't need to be perfect. It needs to be better than manual. 85% good with 20% of the time is a 4x improvement.

What to Do Monday Morning

You've read 3,800 words about custom Copilot agents. Theory is useless without action. Here's your concrete next step:

This week, do these three things:

1. Identify Your High-Value Repeatable Task (Monday, 15 minutes)

During your morning standup or daily planning:

  • Look at your calendar for this week
  • Identify the task that:
    • You do every week (or multiple times per month)
    • Takes 2+ hours
    • Follows a predictable pattern
    • You're tired of doing manually

Write it down:

Task: [name]
Frequency: [X times per week/month]
Time per instance: [Y hours]
Annual time cost: [X × Y × 52 weeks = Z hours]

Common examples for your role:

  • Senior Engineer: Code review checklist, technical design docs, onboarding guides
  • Tech Lead: Architecture decision records, sprint planning templates, tech debt prioritization
  • Architect: System design documentation, trade-off analysis, vendor evaluation frameworks
  • EM: 1-on-1 templates, performance review structures, team health check frameworks

My example:

Task: System design documentation
Frequency: 2x per month
Time per instance: 4 hours
Annual time cost: 2 × 4 × 12 months = 96 hours (2.4 weeks of full-time work)

If you can't identify one task, you're either not doing enough repeatable work (unlikely for senior roles) or you're not paying attention to what drains your time.

2. Document the Manual Process (Tuesday/Wednesday, 30-60 minutes)

Don't start building yet. Spend 30-60 minutes documenting exactly what you do manually.

Create a document:

# [Task Name] - Manual Process

## Inputs Required
- [What I need before starting]

## Steps
1. [First thing I do]
2. [Second thing I do]
...

## Decision Points
- [Where do I have to choose between options?]
- [What criteria do I use to decide?]

## Quality Checks
- [How do I know if it's good?]

## Output
- [What artifact do I produce?]
- [What format?]
- [Where does it go?]

Why this matters: Most people skip this step and jump straight to building the agent. Then they realize mid-build that they don't actually understand their own process. The agent will be as clear as your process documentation.

Time investment: 30-60 minutes upfront saves 2-3 hours of reworking agent instructions later.

3. Create Your First Agent File (Weekend, 3-4 hours)

Saturday or Sunday, block 3-4 hours:

  1. Create file (5 minutes):

    mkdir -p .github/agents
    touch .github/agents/[your-task]-agent.md
    
  2. Write instructions following the template from Step 4 (2-3 hours):

    • Your Role
    • Input You Will Receive
    • Process (step-by-step)
    • Output Format
    • Quality Checks
  3. Test with real scenario (30-60 minutes):

    • Open VS Code Copilot Chat
    • Run @[your-agent-name]
    • Complete one full execution
    • Note what works, what needs improvement
  4. Iterate once (30-60 minutes):

    • Fix the obvious issues
    • Add one validation gate
    • Test again

Expected outcome after first weekend:

  • One working agent (70-80% quality)
  • One real task completed using the agent
  • Clear understanding of what needs improvement
  • Proof that this works

Don't expect perfection. First version will be rough. That's normal. You'll improve it over the next 2-3 uses.


Timeline:

  • Monday: 15 minutes to identify task
  • Tuesday/Wednesday: 30-60 minutes to document process
  • Weekend: 3-4 hours to build and test agent

Total investment: ~4-5 hours over one week

ROI: After 3-5 uses of your agent (typically 2-3 weeks)


If you do nothing else from this article, do those three things.

Building your first agent is like learning to ride a bike. Reading about it doesn't help. You have to actually get on the bike, wobble a bit, maybe fall once, then suddenly you're riding.

One week from now, you'll have your first custom agent. Two weeks from now, you'll wonder how you ever worked without it.

Key Takeaways

Most engineers treat Copilot like an enhanced autocomplete and miss the real leverage: encoding expertise into repeatable workflows. Here's what matters:

  • The Agent Maturity Model is your roadmap. Level 1 (ad-hoc prompts) → Level 2 (copilot-instructions for context) → Level 3 (custom agents for workflows) → Level 4 (orchestrators for complex coordination). Most people stay at Level 1 because they don't know Levels 3 and 4 exist. Each level gives you 5-10x leverage increase. The jump from Level 2 to 3 is where the magic starts—you stop typing prompts and start running systems. My ADR agent: 2 hours → 15 minutes (88% faster).

  • Validation gates are non-negotiable. Your first agent without quality checks will produce beautifully formatted garbage. I learned this the expensive way: $180 in API calls and hours of editing. Input validation (clear requirements), process validation (quality thresholds at intermediate steps), output validation (human reviews final artifact). Gates catch problems early when fixes are cheap. Same reason you don't push to prod without tests.

  • Not every task deserves an agent. Build only for repeatable, high-value, multi-step work that needs consistency. Four questions: Is it repeatable (5+ uses)? Is it high-value (2+ hours)? Is it multi-step (phases with decision points)? Does it need consistency (standard format)? If yes to all four, build the agent. If no to any, just use Copilot Chat manually. One-off exploration, quick refactoring, debugging—not agent-worthy.

  • Orchestrators unlock complex workflows. When you have 3+ distinct phases with different expertise required, build an orchestrator that coordinates specialized sub-agents. My design doc system: 4 agents (requirements analysis, architecture options, design doc writing, implementation planning) coordinated by orchestrator. Each agent is 250-600 lines, reusable independently. Total system: 2,800+ lines. ROI: 4 hours per doc → 30 minutes. 25+ docs generated. 87.5 hours saved in 6 months.

  • ROI hits faster than you think. First agent takes 6-8 hours to build (one afternoon or weekend). ROI after 5-7 uses (typically week 2-3). My ADR agent: 6-hour build, 1.75 hours saved per use, ROI after 3.4 uses (less than 2 weeks). After 6 months: 42 hours saved, still using it weekly. The time investment is front-loaded, the returns compound forever. Your expertise, encoded once, scales infinitely.

Your Next Step

Identify your most time-consuming repeatable task this week. Not someday. This week. The one that takes 2+ hours, follows a pattern, and makes you think "I've done this 20 times, why isn't there a template?"

Open a blank document right now. Write down:

  1. The task name
  2. How often you do it (weekly? bi-weekly? monthly?)
  3. How long it takes each time
  4. The annual time cost (frequency × duration × 52 weeks)

That number—the annual time cost—is what you're about to get back. My calculation: 96 hours per year on system design docs (2× per month × 4 hours each × 12 months). Building the orchestrator took 20 hours. After 5 uses, I hit ROI. After 25 uses, I've saved 87.5 hours (and counting). That's 2+ weeks of 40-hour work weeks I got back.

By Friday, you'll have the foundation for your first agent. By next weekend, you'll have it built. By next month, you'll wonder how you ever worked without it.

Remember: AI can execute workflows, generate documentation, even suggest improvements. But deciding which workflows are worth systematizing, what quality bars to enforce, and whether the output meets your technical standards? That's still on you.

Your expertise isn't in doing the work repeatedly. It's in creating the system that does the work correctly, every time. You make the call.

Topics

github-copilotcopilot-agentsworkflow-automationvscode-agentsdeveloper-productivityai-automationcopilot-orchestratorcustom-agentsagent-md-files2026
Ruchit Suthar

About Ruchit Suthar

15+ years scaling teams from startup to enterprise. 1,000+ technical interviews, 25+ engineers led. Real patterns, zero theory.