On-call is the tax engineers pay for running production systems. And like any tax, if it’s designed poorly, people resent it, avoid it, and eventually leave. I’ve been on both sides — the engineer getting paged six times in one night for alerts that didn’t matter, and the lead designing a rotation that people don’t dread. The difference is never the services. It’s always the rotation design, the alert quality, and the runbook culture.
If your on-call is burning people out, that’s not a people problem. It’s a design problem. And design problems have solutions.
“Monitoring is for asking questions. Alerting is for demanding answers. Know the difference.”
— Charity Majors
Why On-Call Burns People Out
It’s rarely one thing. It’s the compounding effect of several, and understanding them is the first step to fixing them.
| Burnout Factor | What It Looks Like | The Real Cost |
|---|
| Alert fatigue | 15 pages in a night, 12 are false positives | Engineers stop trusting the pager — then miss the real alert |
| Unclear runbooks | Paged at 3 AM with no idea what to do | Longer incidents, more stress, slower resolution |
| No escalation path | Stuck with no one to call for help | 4-hour solo incidents, isolation, resentment |
| No compensation | On-call is “just part of the job” | Your best engineers leave for companies without pagers |
| Uneven distribution | One person gets paged 3x more than others | Feels unfair because it is unfair — it’s a reliability problem masquerading as a rotation problem |
Every on-call system I’ve improved started by honestly diagnosing which of these factors was doing the most damage.
If your team has fewer than 5 engineers, think carefully about whether that team should own on-call at all. A 4-person rotation means on-call every 4 weeks with no backup. One person on leave breaks the whole system. Consider a shared rotation across teams instead.
Rotation Design
The structure of the rotation matters more than most teams realize. Here’s what works at different team sizes:
| Team Size | Rotation Style | On-Call Frequency | Sustainability |
|---|
| 3-4 | Weekly rotation | Every 3-4 weeks | Risky — one person leaving breaks it |
| 5-7 | Weekly rotation | Every 5-7 weeks | Sweet spot for most teams |
| 8-12 | Weekly rotation + backup | Every 8-12 weeks | Comfortable, room for shadow shifts |
| 12+ | Follow-the-sun | Varies by timezone | No after-hours pages — the gold standard |
Handoff rituals matter. The transition between on-call shifts should be a short sync, not a Slack message. Cover these four things in 15 minutes:
It feels like overhead until the first time it prevents an incident.
Writing Good Runbooks: The 3 AM Test
The bar for a good runbook is simple: can a sleep-deprived engineer who didn’t build this service follow it successfully? If the answer is no, rewrite it.
Runbook quality checklist:
Include actual commands, actual links, and actual dashboard URLs. “Check the metrics” is useless. “Open [this specific dashboard URL] and look at the ‘Connection Pool’ panel in the top-right” saves minutes, and minutes matter at 3 AM.
Alert Quality Over Quantity
The single most impactful change you can make to an on-call system is reducing alert volume — not by suppressing alerts, but by making every alert meaningful.
Run a quarterly alert audit. For each alert that fired in the last 90 days, ask:
| Question | If “No”… |
|---|
| Did it require human action? | Automate the response or remove the alert |
| Was the right action obvious from the alert? | Improve the runbook |
| Did it genuinely need to page someone after hours? | Downgrade to business-hours-only |
| Was it a true positive? | If false positive rate > 10%, fix the threshold or delete it |
On a project I worked on, our first alert audit reduced pages from 45 per week to 12. The 33 we removed were false positives, informational alerts that should have been dashboard metrics, or alerts for conditions that self-healed within minutes.
Consider SLO-based alerting. Instead of alerting on individual symptoms (“CPU > 80%”, “memory > 90%”), alert on error budget burn rate. If your SLO is 99.9% availability, an alert fires when you’re burning through your error budget faster than sustainable. CPU spikes that don’t affect users? No page. Brief latency bumps within SLO? No page. Only sustained degradation triggers a human response.
Escalation Policies
A clear escalation path prevents the on-call engineer from being a single point of failure. It’s not punishment — it’s support.
| Time | What Happens |
|---|
| 0 min | Primary on-call paged |
| 5 min (no ack) | Secondary on-call paged |
| 10 min (no ack) | Team lead paged |
| 15 min (acked but unresolved) | On-call decides: handle solo or escalate for help |
| 30 min (unresolved SEV 1) | Engineering manager paged |
| 60 min (unresolved SEV 1) | Senior leadership notified |
When I get paged as an escalation, my job is to help, not to judge. The goal is resolution, not blame.
Protecting Personal Time
On-call only works long-term if engineers can actually live their lives during the rotation. These aren’t perks — they’re the minimum for sustainability:
Track on-call burden per engineer per quarter. If someone’s after-hours page count is more than 2x the team average, investigate the specific alerts causing it. Either fix the underlying reliability issue or rebalance the rotation. On-call fairness is a retention issue dressed up as an operational one.
On-Call Retrospectives
After every rotation, the outgoing engineer writes a brief retro covering: pages received (count, severity, time of day), false positives, runbook gaps, and improvement suggestions. These retros feed into the quarterly alert audit and surface problems continuously — instead of waiting for someone to burn out and quit before you notice.
The best on-call systems are invisible to the engineers in them. The pager rarely fires. When it does, the runbook is clear. Escalation is easy. The experience is manageable. That’s never an accident — it’s the result of deliberate design, and it’s worth every minute you invest in getting it right.