Designing On-Call Schedules That Don't Burn Out Your Engineers

The Zombie Engineer Story

Meet Sarah. She's a senior backend engineer on your team. Technically brilliant. Owns the payments system. Everyone respects her work.

She's also exhausted.

Last week, she got paged four times between midnight and 5 AM. Not for critical outages—for noisy alerts that could have waited until morning. Saturday afternoon, another page during her daughter's birthday party. Sunday evening, two more false positives that reset her sleep schedule before Monday morning standup.

She's been on-call for six months straight because the team is understaffed and "she knows the system best." She doesn't complain because she's professional, but you've noticed she's slower to respond to messages, taking more sick days, and her last 1:1 felt distant.

Sarah isn't struggling because she's weak. She's struggling because your on-call system is broken.

Here's the uncomfortable truth: if your on-call design creates zombie engineers who are always tired, always on edge, and slowly burning out, your system is unhealthy—even if your uptime dashboard looks great.

Reliability and sustainability aren't opposing goals. They're the same problem. Let's fix both.

What a Healthy On-Call System Looks Like

Before we dive into mechanics, let's paint a picture of what good looks like.

In a healthy on-call system:

Rotations are predictable and fair
Engineers know when they're on-call weeks in advance. The load is distributed evenly. No one carries the pager for months because "they're the only one who knows the system."

Alerts have high signal, low noise
When the pager goes off, it's always for something that requires immediate human action. False positives are rare and treated as bugs to be fixed.

Responsibilities are documented
Every on-call engineer has clear runbooks. They know what to do, who to escalate to, and how to verify the fix worked.

Recovery time is respected
If you get paged at 3 AM, you're not expected in standup at 9 AM. After a rough on-call week, you get recovery time before the next rotation.

Incidents trend downward over time
The team treats every incident as a learning opportunity. Systems get more resilient. Alerts get smarter. On-call gets easier, not harder.

Engineers volunteer for on-call
This sounds impossible, but I've seen it. When on-call is well-designed, senior engineers willingly participate because they trust the system won't destroy their lives.

This isn't utopian thinking. This is achievable with deliberate design.

Designing the Rotation

Let's start with the foundation: how you structure the rotation itself.

Key Variables to Consider

Team size:

Minimum 4–5 engineers for sustainable rotation
Smaller teams need secondary coverage or shared rotation across teams
If you can't staff this, you have a hiring problem, not an on-call problem

Time zones:

Single timezone: simpler rotation
Distributed team: follow-the-sun can reduce sleep disruption
Don't expect one person to cover 24/7 across all timezones

Product criticality:

E-commerce during Black Friday: needs robust coverage
Internal tools: can tolerate longer response times
Adjust rotation intensity to actual business impact

Rotation Patterns That Work

Primary + Secondary Model

The most common and effective pattern:

Primary on-call: First responder, handles all alerts
Secondary on-call: Backup if primary doesn't respond in 10–15 minutes, or escalation for complex issues

This provides:

Safety net when primary is unavailable or overwhelmed
Learning opportunity for less experienced engineers (shadow as secondary first)
Clear escalation path

Rotation Length: 1 Week Sweet Spot

After years of experimentation, one-week rotations work best for most teams:

Why 1 week:

Long enough to minimize handover overhead
Short enough that a rough week doesn't destroy morale
Predictable end date keeps stress manageable

Why not longer:

Two-week rotations amplify cumulative fatigue
Engineers start dreading the rotation weeks in advance
Harder to maintain coverage during holidays and PTO

Why not shorter:

Daily rotations create too much handover overhead
Weekend-only rotations feel punitive
Context switching is expensive

Handovers and Follow-the-Sun

Structured handovers matter:

At the end of each rotation, do a 15-minute sync:

What incidents happened this week?
Any ongoing issues or monitoring concerns?
Any alerts that feel noisy or wrong?
Brief the next person on system state

For distributed teams, follow-the-sun rotation can be magical:

APAC engineer hands off to EMEA engineer hands off to Americas engineer
Reduces sleep disruption dramatically
Requires good documentation and async communication
Not always feasible, but worth considering

Time Off and Backup Coverage

Plan for this explicitly:

Every engineer gets at least 2 weeks off per year where they're completely unreachable
Rotation schedules account for holidays and PTO in advance
No one should ever cancel vacation because "we don't have coverage"
If you can't cover planned vacation, you're understaffed

The backup rotation rule:
If someone is on-call and has to page secondary more than twice in a week, that's a signal something is wrong—with the system, the documentation, or the training.

Designing Alerts: Signal Over Noise

Alert fatigue kills on-call sustainability faster than anything else.

The Golden Rule

Only alert a human when immediate action is needed right now.

Everything else goes to:

Dashboards (check in the morning)
Email digests (daily summary)
Slack channels (non-urgent visibility)
Weekly reports (trend analysis)

Examples: Good vs Bad Alerts

Bad Alert: Low Signal

🔴 CRITICAL: Disk usage at 62% on db-replica-03

Why it's bad:

Not actionable (62% isn't critical)
Creates noise and desensitizes engineers
Should be a warning, not a page

Good Alert: High Signal

🔴 CRITICAL: Payment processing failing - 95% error rate
Action: Check payment gateway health, review recent deploys
Runbook: https://wiki/payments-down

Why it's good:

Clearly urgent (customers can't pay)
Provides context (error rate)
Points to next steps immediately

Alert Design Principles

1. Actionability
Ask: "What should the on-call engineer do right now?"
If the answer is "nothing yet, just monitor," it's not a page.

2. Thresholds

CPU at 60%? Probably fine.
CPU at 95% for 5+ minutes? Page.
Error rate jumped from 0.1% to 10%? Page.
Error rate at steady 0.5%? Dashboard metric, not page.

3. Time-of-day awareness
Some issues can wait until morning:

Non-critical batch job failed at 2 AM → morning alert
Payment API down at 2 AM → immediate page
Low-traffic internal tool slow at 2 AM → morning alert

4. Self-healing first
If a system can auto-restart or auto-scale to handle the issue, let it. Alert after it self-heals if you want visibility, but don't page preemptively.

Treating Alert Fatigue as a Bug

Every false positive or noisy alert should be treated like a P1 bug:

Document it in the incident log
Create a ticket to fix the alert threshold
Review in the next retro

Track this metric: alerts per on-call shift.

Target: <5 actionable alerts per week
Warning zone: 10–15 alerts per week
Crisis zone: >20 alerts per week (system or alert design is broken)

Runbooks, Tooling, and Self-Healing

On-call engineers shouldn't have to be heroes who figure everything out from scratch at 3 AM.

The Minimal Runbook

For every alert, provide a runbook with:

1. What this alert means
"Payment gateway is returning 500 errors at high rate. This blocks customer transactions."

2. Immediate first steps

Check payment gateway status page
Review last 3 deploys in #payments-deploys
Check error logs: kubectl logs -n payments service/gateway --tail=100

3. How to verify it's fixed

Payment success rate returns to >99%
Error rate drops below 1%
Customer support tickets stop spiking

4. Escalation path

If you can't resolve in 15 minutes, page secondary: @payments-oncall-secondary
If systemic issue, escalate to EM: @engineering-manager
If payment provider outage, notify leadership: @leadership-oncall

5. Common fixes

Restart service: kubectl rollout restart deployment/payment-gateway
Roll back last deploy: ./scripts/rollback.sh payments v1.2.3
Failover to backup provider: ./scripts/failover-payment.sh

Tooling to Reduce Cognitive Load

Incident dashboard:
One place to see all active alerts, recent deployments, system health. Examples: PagerDuty, Opsgenie, custom dashboards.

Chat integrations:
Alerts should go to dedicated Slack/Teams channels with context. Engineers can collaborate there instead of pinging each other randomly.

One-click common actions:
Scripts for common fixes (restart, rollback, failover) that are tested and safe. Reduces decision fatigue at 3 AM.

Building Self-Healing Systems

The best on-call is no on-call. Invest in systems that fix themselves:

Auto-scaling:
Traffic spike? System scales up automatically. No page needed.

Circuit breakers:
Dependency failing? Circuit breaker opens, system degrades gracefully. Alert in morning, don't page at night.

Auto-restart on crashes:
Service crashes? Orchestrator (Kubernetes, ECS) restarts it. Page only if it crashes repeatedly.

Health checks and failover:
Primary database unresponsive? Automatic failover to replica. Page about the incident, but system stayed up.

Every hour you invest in self-healing saves dozens of hours of on-call stress.

Post-Incident Rituals (Blameless and Useful)

How you handle incidents determines whether engineers trust the on-call system.

The 30-Minute Blameless Postmortem

Within 48 hours of any significant incident, do a short, focused postmortem. Not a 3-hour marathon. 30 minutes, structured.

Agenda:

1. Timeline (5 minutes)
What happened, when?

2:47 AM: Alert fired for payment gateway errors
2:52 AM: On-call engineer investigated logs
3:15 AM: Root cause identified (rate limit hit)
3:30 AM: Rate limit increased, system recovered

2. Root cause (5 minutes)
Why did this happen?

Traffic spiked due to flash sale
Rate limiting threshold was set too low
No auto-scaling for sudden traffic

3. What went well (5 minutes)

Alert fired promptly
Runbook was clear
Recovery was fast (43 minutes)

4. What needs improvement (10 minutes)

Need better traffic forecasting before sales
Rate limits should be reviewed quarterly
Should implement auto-scaling for payment service

5. Action items with owners (5 minutes)

@sarah: Implement auto-scaling for payment gateway (1 week)
@mike: Review and document rate limit thresholds (3 days)
@product: Give engineering 48hr notice before flash sales (ongoing)

The Blameless Principle

Never:
"This happened because Sarah didn't check the logs carefully."

Instead:
"This happened because our logs don't surface rate limit errors prominently. Let's make them more visible."

Never:
"Mike should have known the threshold was too low."

Instead:
"We didn't have a process for reviewing thresholds. Let's schedule quarterly reviews."

The goal is system improvement, not individual blame. If people are afraid of being blamed, they'll hide problems instead of surfacing them.

On-Call Health Metrics

You can't improve what you don't measure. Track these metrics quarterly:

System Health Metrics

1. Alerts per on-call shift

Target: <5 per week
If higher, investigate noisy alerts or system instability

2. Sleep-hour alerts (10 PM – 7 AM)

Target: <20% of total alerts
If higher, look for issues that could auto-heal or wait until morning

3. Mean Time to Acknowledge (MTTA)

Target: <5 minutes
If higher, check if alerts are clear and runbooks are helpful

4. Mean Time to Recovery (MTTR)

Track per incident type
Trending upward? Systems are getting more complex or less documented

5. Incident trend

Are incidents decreasing month-over-month?
If flat or increasing, you're not learning from postmortems

Human Health Metrics

1. On-call load distribution

Is one person carrying 50% of rotations? Red flag.
Aim for even distribution across team.

2. Burnout signals

Engineers refusing on-call duty
Increased sick days during on-call weeks
Turnover among on-call engineers
Complaints in retros or 1:1s

3. Time to escalation

How often does primary escalate to secondary?
High escalation rate suggests training or documentation gaps

4. Post-incident recovery time

Are engineers getting time off after rough on-call weeks?
Track this explicitly.

The Quarterly On-Call Health Review

Every quarter, spend 30 minutes reviewing these metrics with the team:

What's getting better?
What's getting worse?
What should we change about the rotation, alerts, or processes?

This shows engineers you care about sustainability, not just uptime.

Culture: Respecting the Pager

Metrics and processes only work if the culture supports them.

Don't Glorify Hero Firefighting

Anti-pattern:
"Shoutout to Mike for staying up until 4 AM to fix the outage! He's a rockstar! 🎸"

Why it's harmful:

Glorifies unsustainable behavior
Pressures others to also be "heroes"
Doesn't address why Mike had to stay up until 4 AM

Better approach:
"Thanks to Mike for handling the outage. We're scheduling a postmortem to understand why it happened and prevent it from happening again. Mike, take tomorrow morning off."

Reward System Improvement, Not Just Firefighting

Publicly recognize engineers who:

Reduce alert noise
Write clear runbooks
Build self-healing mechanisms
Improve system reliability

This shifts incentives from "reactive hero" to "proactive system designer."

Compensate On-Call Fairly

Options teams use:

On-call stipend: Fixed payment per week on-call (e.g., $500–$1000/week depending on company stage)
Time-in-lieu: If you work 6 hours during off-hours, take 6 hours off later that week
Mandatory recovery time: After a rough on-call week (e.g., >10 hours off-hours work), engineer gets a recovery day
Rotation bonus: Annual bonus component tied to on-call participation

The principle: On-call is extra responsibility and disruption. Compensate accordingly.

Set Clear Norms from Leadership

If you're an EM or CTO, say things like:

"On-call is a shared responsibility. No one should carry it alone for months."

"If you get paged at 2 AM, don't come to standup. Sleep in and join us after lunch."

"Every incident is a chance to make the system better. We don't blame people, we improve systems."

"If on-call is burning you out, tell me. That's not on you—that's on the system, and we'll fix it."

Leadership behavior sets the culture. Model the behavior you want to see.

Reliability Without Sacrificing People

Let me close with the central belief that should guide every on-call decision:

Production reliability and human sustainability are not trade-offs. They're the same problem.

Systems that burn out engineers will eventually fail because:

Exhausted engineers make mistakes
Good engineers leave for healthier teams
Documentation and improvements never happen
Technical debt compounds
Incident response gets slower, not faster

The most reliable systems I've seen are run by teams where:

On-call is sustainable and fairly distributed
Alerts have high signal and low noise
Engineers trust the system won't destroy their lives
Every incident makes the system better

You can have both reliability and sustainability. In fact, you can't have one without the other for long.

Your On-Call Health Checklist

Use this to evaluate and improve your current on-call setup:

Rotation Design

Team has at least 4–5 engineers for sustainable rotation
Primary + secondary model for safety net
One-week rotations (not longer)
Planned handover process with notes
On-call schedule published 4+ weeks in advance
Clear process for backup coverage during PTO

Alert Quality

Alerts only fire when immediate action is needed
<5 actionable alerts per on-call shift
<20% of alerts during sleep hours (10 PM – 7 AM)
Every alert has a runbook
False positives are treated as bugs and fixed

Documentation & Tooling

Clear runbooks for every alert with next steps
One-click access to logs, metrics, and dashboards
Escalation paths documented
Common fixes scripted and tested
Incident response chat channel exists

Post-Incident Processes

Blameless postmortems within 48 hours
Action items have owners and deadlines
Incidents trend downward over time
Learnings are shared with broader team

Cultural Support

On-call is compensated (stipend, time-off, or bonus)
Recovery time after rough on-call weeks
Leadership doesn't glorify hero firefighting
Engineers volunteer for on-call, don't avoid it
Quarterly health review of on-call system

Metrics Tracked

Alerts per shift
Sleep-hour alerts percentage
MTTA and MTTR
On-call load distribution
Burnout signals (turnover, complaints, refusals)

If your current on-call creates zombie engineers, you don't have weak engineers. You have a broken system.

Start with one thing from this checklist. Fix it. Measure the impact. Move to the next.

Build a system that keeps services running and keeps engineers healthy.

Both matter equally.

TL;DR