Designing On-Call Schedules That Don't Burn Out Your Engineers
If your on-call design creates zombie engineers who are always tired and on edge, your system is unhealthy—even if uptime looks good. Learn how to design on-call rotations, alerts, and culture that keep systems reliable and engineers sustainable.

TL;DR
Healthy on-call has predictable rotations, high-signal alerts, clear runbooks, and respected recovery time. Distribute load fairly, treat false positives as bugs, and make incidents trend downward over time. Reliability and sustainability aren't opposing goals—they're the same problem.
Designing On-Call Schedules That Don't Burn Out Your Engineers
The Zombie Engineer Story
Meet Sarah. She's a senior backend engineer on your team. Technically brilliant. Owns the payments system. Everyone respects her work.
She's also exhausted.
Last week, she got paged four times between midnight and 5 AM. Not for critical outages—for noisy alerts that could have waited until morning. Saturday afternoon, another page during her daughter's birthday party. Sunday evening, two more false positives that reset her sleep schedule before Monday morning standup.
She's been on-call for six months straight because the team is understaffed and "she knows the system best." She doesn't complain because she's professional, but you've noticed she's slower to respond to messages, taking more sick days, and her last 1:1 felt distant.
Sarah isn't struggling because she's weak. She's struggling because your on-call system is broken.
Here's the uncomfortable truth: if your on-call design creates zombie engineers who are always tired, always on edge, and slowly burning out, your system is unhealthy—even if your uptime dashboard looks great.
Reliability and sustainability aren't opposing goals. They're the same problem. Let's fix both.
What a Healthy On-Call System Looks Like
Before we dive into mechanics, let's paint a picture of what good looks like.
In a healthy on-call system:
Rotations are predictable and fair
Engineers know when they're on-call weeks in advance. The load is distributed evenly. No one carries the pager for months because "they're the only one who knows the system."
Alerts have high signal, low noise
When the pager goes off, it's always for something that requires immediate human action. False positives are rare and treated as bugs to be fixed.
Responsibilities are documented
Every on-call engineer has clear runbooks. They know what to do, who to escalate to, and how to verify the fix worked.
Recovery time is respected
If you get paged at 3 AM, you're not expected in standup at 9 AM. After a rough on-call week, you get recovery time before the next rotation.
Incidents trend downward over time
The team treats every incident as a learning opportunity. Systems get more resilient. Alerts get smarter. On-call gets easier, not harder.
Engineers volunteer for on-call
This sounds impossible, but I've seen it. When on-call is well-designed, senior engineers willingly participate because they trust the system won't destroy their lives.
This isn't utopian thinking. This is achievable with deliberate design.
Designing the Rotation
Let's start with the foundation: how you structure the rotation itself.
Key Variables to Consider
Team size:
- Minimum 4–5 engineers for sustainable rotation
- Smaller teams need secondary coverage or shared rotation across teams
- If you can't staff this, you have a hiring problem, not an on-call problem
Time zones:
- Single timezone: simpler rotation
- Distributed team: follow-the-sun can reduce sleep disruption
- Don't expect one person to cover 24/7 across all timezones
Product criticality:
- E-commerce during Black Friday: needs robust coverage
- Internal tools: can tolerate longer response times
- Adjust rotation intensity to actual business impact
Rotation Patterns That Work
Primary + Secondary Model
The most common and effective pattern:
- Primary on-call: First responder, handles all alerts
- Secondary on-call: Backup if primary doesn't respond in 10–15 minutes, or escalation for complex issues
This provides:
- Safety net when primary is unavailable or overwhelmed
- Learning opportunity for less experienced engineers (shadow as secondary first)
- Clear escalation path
Rotation Length: 1 Week Sweet Spot
After years of experimentation, one-week rotations work best for most teams:
Why 1 week:
- Long enough to minimize handover overhead
- Short enough that a rough week doesn't destroy morale
- Predictable end date keeps stress manageable
Why not longer:
- Two-week rotations amplify cumulative fatigue
- Engineers start dreading the rotation weeks in advance
- Harder to maintain coverage during holidays and PTO
Why not shorter:
- Daily rotations create too much handover overhead
- Weekend-only rotations feel punitive
- Context switching is expensive
Handovers and Follow-the-Sun
Structured handovers matter:
At the end of each rotation, do a 15-minute sync:
- What incidents happened this week?
- Any ongoing issues or monitoring concerns?
- Any alerts that feel noisy or wrong?
- Brief the next person on system state
For distributed teams, follow-the-sun rotation can be magical:
- APAC engineer hands off to EMEA engineer hands off to Americas engineer
- Reduces sleep disruption dramatically
- Requires good documentation and async communication
- Not always feasible, but worth considering
Time Off and Backup Coverage
Plan for this explicitly:
- Every engineer gets at least 2 weeks off per year where they're completely unreachable
- Rotation schedules account for holidays and PTO in advance
- No one should ever cancel vacation because "we don't have coverage"
- If you can't cover planned vacation, you're understaffed
The backup rotation rule:
If someone is on-call and has to page secondary more than twice in a week, that's a signal something is wrong—with the system, the documentation, or the training.
Designing Alerts: Signal Over Noise
Alert fatigue kills on-call sustainability faster than anything else.
The Golden Rule
Only alert a human when immediate action is needed right now.
Everything else goes to:
- Dashboards (check in the morning)
- Email digests (daily summary)
- Slack channels (non-urgent visibility)
- Weekly reports (trend analysis)
Examples: Good vs Bad Alerts
Bad Alert: Low Signal
🔴 CRITICAL: Disk usage at 62% on db-replica-03
Why it's bad:
- Not actionable (62% isn't critical)
- Creates noise and desensitizes engineers
- Should be a warning, not a page
Good Alert: High Signal
🔴 CRITICAL: Payment processing failing - 95% error rate
Action: Check payment gateway health, review recent deploys
Runbook: https://wiki/payments-down
Why it's good:
- Clearly urgent (customers can't pay)
- Provides context (error rate)
- Points to next steps immediately
Alert Design Principles
1. Actionability
Ask: "What should the on-call engineer do right now?"
If the answer is "nothing yet, just monitor," it's not a page.
2. Thresholds
- CPU at 60%? Probably fine.
- CPU at 95% for 5+ minutes? Page.
- Error rate jumped from 0.1% to 10%? Page.
- Error rate at steady 0.5%? Dashboard metric, not page.
3. Time-of-day awareness
Some issues can wait until morning:
- Non-critical batch job failed at 2 AM → morning alert
- Payment API down at 2 AM → immediate page
- Low-traffic internal tool slow at 2 AM → morning alert
4. Self-healing first
If a system can auto-restart or auto-scale to handle the issue, let it. Alert after it self-heals if you want visibility, but don't page preemptively.
Treating Alert Fatigue as a Bug
Every false positive or noisy alert should be treated like a P1 bug:
- Document it in the incident log
- Create a ticket to fix the alert threshold
- Review in the next retro
Track this metric: alerts per on-call shift.
- Target: <5 actionable alerts per week
- Warning zone: 10–15 alerts per week
- Crisis zone: >20 alerts per week (system or alert design is broken)
Runbooks, Tooling, and Self-Healing
On-call engineers shouldn't have to be heroes who figure everything out from scratch at 3 AM.
The Minimal Runbook
For every alert, provide a runbook with:
1. What this alert means
"Payment gateway is returning 500 errors at high rate. This blocks customer transactions."
2. Immediate first steps
- Check payment gateway status page
- Review last 3 deploys in #payments-deploys
- Check error logs:
kubectl logs -n payments service/gateway --tail=100
3. How to verify it's fixed
- Payment success rate returns to >99%
- Error rate drops below 1%
- Customer support tickets stop spiking
4. Escalation path
- If you can't resolve in 15 minutes, page secondary: @payments-oncall-secondary
- If systemic issue, escalate to EM: @engineering-manager
- If payment provider outage, notify leadership: @leadership-oncall
5. Common fixes
- Restart service:
kubectl rollout restart deployment/payment-gateway - Roll back last deploy:
./scripts/rollback.sh payments v1.2.3 - Failover to backup provider:
./scripts/failover-payment.sh
Tooling to Reduce Cognitive Load
Incident dashboard:
One place to see all active alerts, recent deployments, system health. Examples: PagerDuty, Opsgenie, custom dashboards.
Chat integrations:
Alerts should go to dedicated Slack/Teams channels with context. Engineers can collaborate there instead of pinging each other randomly.
One-click common actions:
Scripts for common fixes (restart, rollback, failover) that are tested and safe. Reduces decision fatigue at 3 AM.
Building Self-Healing Systems
The best on-call is no on-call. Invest in systems that fix themselves:
Auto-scaling:
Traffic spike? System scales up automatically. No page needed.
Circuit breakers:
Dependency failing? Circuit breaker opens, system degrades gracefully. Alert in morning, don't page at night.
Auto-restart on crashes:
Service crashes? Orchestrator (Kubernetes, ECS) restarts it. Page only if it crashes repeatedly.
Health checks and failover:
Primary database unresponsive? Automatic failover to replica. Page about the incident, but system stayed up.
Every hour you invest in self-healing saves dozens of hours of on-call stress.
Post-Incident Rituals (Blameless and Useful)
How you handle incidents determines whether engineers trust the on-call system.
The 30-Minute Blameless Postmortem
Within 48 hours of any significant incident, do a short, focused postmortem. Not a 3-hour marathon. 30 minutes, structured.
Agenda:
1. Timeline (5 minutes)
What happened, when?
- 2:47 AM: Alert fired for payment gateway errors
- 2:52 AM: On-call engineer investigated logs
- 3:15 AM: Root cause identified (rate limit hit)
- 3:30 AM: Rate limit increased, system recovered
2. Root cause (5 minutes)
Why did this happen?
- Traffic spiked due to flash sale
- Rate limiting threshold was set too low
- No auto-scaling for sudden traffic
3. What went well (5 minutes)
- Alert fired promptly
- Runbook was clear
- Recovery was fast (43 minutes)
4. What needs improvement (10 minutes)
- Need better traffic forecasting before sales
- Rate limits should be reviewed quarterly
- Should implement auto-scaling for payment service
5. Action items with owners (5 minutes)
- @sarah: Implement auto-scaling for payment gateway (1 week)
- @mike: Review and document rate limit thresholds (3 days)
- @product: Give engineering 48hr notice before flash sales (ongoing)
The Blameless Principle
Never:
"This happened because Sarah didn't check the logs carefully."
Instead:
"This happened because our logs don't surface rate limit errors prominently. Let's make them more visible."
Never:
"Mike should have known the threshold was too low."
Instead:
"We didn't have a process for reviewing thresholds. Let's schedule quarterly reviews."
The goal is system improvement, not individual blame. If people are afraid of being blamed, they'll hide problems instead of surfacing them.
On-Call Health Metrics
You can't improve what you don't measure. Track these metrics quarterly:
System Health Metrics
1. Alerts per on-call shift
- Target: <5 per week
- If higher, investigate noisy alerts or system instability
2. Sleep-hour alerts (10 PM – 7 AM)
- Target: <20% of total alerts
- If higher, look for issues that could auto-heal or wait until morning
3. Mean Time to Acknowledge (MTTA)
- Target: <5 minutes
- If higher, check if alerts are clear and runbooks are helpful
4. Mean Time to Recovery (MTTR)
- Track per incident type
- Trending upward? Systems are getting more complex or less documented
5. Incident trend
- Are incidents decreasing month-over-month?
- If flat or increasing, you're not learning from postmortems
Human Health Metrics
1. On-call load distribution
- Is one person carrying 50% of rotations? Red flag.
- Aim for even distribution across team.
2. Burnout signals
- Engineers refusing on-call duty
- Increased sick days during on-call weeks
- Turnover among on-call engineers
- Complaints in retros or 1:1s
3. Time to escalation
- How often does primary escalate to secondary?
- High escalation rate suggests training or documentation gaps
4. Post-incident recovery time
- Are engineers getting time off after rough on-call weeks?
- Track this explicitly.
The Quarterly On-Call Health Review
Every quarter, spend 30 minutes reviewing these metrics with the team:
- What's getting better?
- What's getting worse?
- What should we change about the rotation, alerts, or processes?
This shows engineers you care about sustainability, not just uptime.
Culture: Respecting the Pager
Metrics and processes only work if the culture supports them.
Don't Glorify Hero Firefighting
Anti-pattern:
"Shoutout to Mike for staying up until 4 AM to fix the outage! He's a rockstar! 🎸"
Why it's harmful:
- Glorifies unsustainable behavior
- Pressures others to also be "heroes"
- Doesn't address why Mike had to stay up until 4 AM
Better approach:
"Thanks to Mike for handling the outage. We're scheduling a postmortem to understand why it happened and prevent it from happening again. Mike, take tomorrow morning off."
Reward System Improvement, Not Just Firefighting
Publicly recognize engineers who:
- Reduce alert noise
- Write clear runbooks
- Build self-healing mechanisms
- Improve system reliability
This shifts incentives from "reactive hero" to "proactive system designer."
Compensate On-Call Fairly
Options teams use:
On-call stipend: Fixed payment per week on-call (e.g., $500–$1000/week depending on company stage)
Time-in-lieu: If you work 6 hours during off-hours, take 6 hours off later that week
Mandatory recovery time: After a rough on-call week (e.g., >10 hours off-hours work), engineer gets a recovery day
Rotation bonus: Annual bonus component tied to on-call participation
The principle: On-call is extra responsibility and disruption. Compensate accordingly.
Set Clear Norms from Leadership
If you're an EM or CTO, say things like:
"On-call is a shared responsibility. No one should carry it alone for months."
"If you get paged at 2 AM, don't come to standup. Sleep in and join us after lunch."
"Every incident is a chance to make the system better. We don't blame people, we improve systems."
"If on-call is burning you out, tell me. That's not on you—that's on the system, and we'll fix it."
Leadership behavior sets the culture. Model the behavior you want to see.
Reliability Without Sacrificing People
Let me close with the central belief that should guide every on-call decision:
Production reliability and human sustainability are not trade-offs. They're the same problem.
Systems that burn out engineers will eventually fail because:
- Exhausted engineers make mistakes
- Good engineers leave for healthier teams
- Documentation and improvements never happen
- Technical debt compounds
- Incident response gets slower, not faster
The most reliable systems I've seen are run by teams where:
- On-call is sustainable and fairly distributed
- Alerts have high signal and low noise
- Engineers trust the system won't destroy their lives
- Every incident makes the system better
You can have both reliability and sustainability. In fact, you can't have one without the other for long.
Your On-Call Health Checklist
Use this to evaluate and improve your current on-call setup:
Rotation Design
- Team has at least 4–5 engineers for sustainable rotation
- Primary + secondary model for safety net
- One-week rotations (not longer)
- Planned handover process with notes
- On-call schedule published 4+ weeks in advance
- Clear process for backup coverage during PTO
Alert Quality
- Alerts only fire when immediate action is needed
- <5 actionable alerts per on-call shift
- <20% of alerts during sleep hours (10 PM – 7 AM)
- Every alert has a runbook
- False positives are treated as bugs and fixed
Documentation & Tooling
- Clear runbooks for every alert with next steps
- One-click access to logs, metrics, and dashboards
- Escalation paths documented
- Common fixes scripted and tested
- Incident response chat channel exists
Post-Incident Processes
- Blameless postmortems within 48 hours
- Action items have owners and deadlines
- Incidents trend downward over time
- Learnings are shared with broader team
Cultural Support
- On-call is compensated (stipend, time-off, or bonus)
- Recovery time after rough on-call weeks
- Leadership doesn't glorify hero firefighting
- Engineers volunteer for on-call, don't avoid it
- Quarterly health review of on-call system
Metrics Tracked
- Alerts per shift
- Sleep-hour alerts percentage
- MTTA and MTTR
- On-call load distribution
- Burnout signals (turnover, complaints, refusals)
If your current on-call creates zombie engineers, you don't have weak engineers. You have a broken system.
Start with one thing from this checklist. Fix it. Measure the impact. Move to the next.
Build a system that keeps services running and keeps engineers healthy.
Both matter equally.
