Skip to main content
Team collaboratively planning their on-call rotation and processes On-call is the tax engineers pay for running production systems. And like any tax, if it’s designed poorly, people resent it, avoid it, and eventually leave. I’ve been on both sides — the engineer getting paged six times in one night for alerts that didn’t matter, and the lead designing a rotation that people don’t dread. The difference is never the services. It’s always the rotation design, the alert quality, and the runbook culture. If your on-call is burning people out, that’s not a people problem. It’s a design problem. And design problems have solutions.
“Monitoring is for asking questions. Alerting is for demanding answers. Know the difference.” — Charity Majors

Why On-Call Burns People Out

It’s rarely one thing. It’s the compounding effect of several, and understanding them is the first step to fixing them.
Burnout FactorWhat It Looks LikeThe Real Cost
Alert fatigue15 pages in a night, 12 are false positivesEngineers stop trusting the pager — then miss the real alert
Unclear runbooksPaged at 3 AM with no idea what to doLonger incidents, more stress, slower resolution
No escalation pathStuck with no one to call for help4-hour solo incidents, isolation, resentment
No compensationOn-call is “just part of the job”Your best engineers leave for companies without pagers
Uneven distributionOne person gets paged 3x more than othersFeels unfair because it is unfair — it’s a reliability problem masquerading as a rotation problem
Every on-call system I’ve improved started by honestly diagnosing which of these factors was doing the most damage.
If your team has fewer than 5 engineers, think carefully about whether that team should own on-call at all. A 4-person rotation means on-call every 4 weeks with no backup. One person on leave breaks the whole system. Consider a shared rotation across teams instead.

Rotation Design

The structure of the rotation matters more than most teams realize. Here’s what works at different team sizes:
Team SizeRotation StyleOn-Call FrequencySustainability
3-4Weekly rotationEvery 3-4 weeksRisky — one person leaving breaks it
5-7Weekly rotationEvery 5-7 weeksSweet spot for most teams
8-12Weekly rotation + backupEvery 8-12 weeksComfortable, room for shadow shifts
12+Follow-the-sunVaries by timezoneNo after-hours pages — the gold standard
Handoff rituals matter. The transition between on-call shifts should be a short sync, not a Slack message. Cover these four things in 15 minutes:
  • Active incidents or concerns — anything simmering that might escalate
  • Recent deploys — what shipped this week that might cause issues
  • Alert noise — known false positives or noisy alerts being worked on
  • Personal context — “I have an appointment Wednesday morning” so the backup knows when to be attentive
It feels like overhead until the first time it prevents an incident.

Writing Good Runbooks: The 3 AM Test

The bar for a good runbook is simple: can a sleep-deprived engineer who didn’t build this service follow it successfully? If the answer is no, rewrite it. Runbook quality checklist:
  • Explains what the alert means in plain language (not just “CPU is high” but why it matters and what it correlates with)
  • Links to the specific dashboard — not “check Grafana” but the actual URL with the relevant panel
  • Provides a decision tree — “if X, do Y; if Z, do W” — so the on-call doesn’t have to diagnose from scratch
  • Includes exact commands to run, not vague instructions
  • Has a clear escalation path with names or team handles
  • Lists related runbooks for adjacent failure modes
  • Shows a “last updated” date so you know if it’s stale
  • Has been tested by someone who didn’t write it
Include actual commands, actual links, and actual dashboard URLs. “Check the metrics” is useless. “Open [this specific dashboard URL] and look at the ‘Connection Pool’ panel in the top-right” saves minutes, and minutes matter at 3 AM.

Alert Quality Over Quantity

The single most impactful change you can make to an on-call system is reducing alert volume — not by suppressing alerts, but by making every alert meaningful. Run a quarterly alert audit. For each alert that fired in the last 90 days, ask:
QuestionIf “No”…
Did it require human action?Automate the response or remove the alert
Was the right action obvious from the alert?Improve the runbook
Did it genuinely need to page someone after hours?Downgrade to business-hours-only
Was it a true positive?If false positive rate > 10%, fix the threshold or delete it
On a project I worked on, our first alert audit reduced pages from 45 per week to 12. The 33 we removed were false positives, informational alerts that should have been dashboard metrics, or alerts for conditions that self-healed within minutes. Consider SLO-based alerting. Instead of alerting on individual symptoms (“CPU > 80%”, “memory > 90%”), alert on error budget burn rate. If your SLO is 99.9% availability, an alert fires when you’re burning through your error budget faster than sustainable. CPU spikes that don’t affect users? No page. Brief latency bumps within SLO? No page. Only sustained degradation triggers a human response.

Escalation Policies

A clear escalation path prevents the on-call engineer from being a single point of failure. It’s not punishment — it’s support.
TimeWhat Happens
0 minPrimary on-call paged
5 min (no ack)Secondary on-call paged
10 min (no ack)Team lead paged
15 min (acked but unresolved)On-call decides: handle solo or escalate for help
30 min (unresolved SEV 1)Engineering manager paged
60 min (unresolved SEV 1)Senior leadership notified
When I get paged as an escalation, my job is to help, not to judge. The goal is resolution, not blame.

Protecting Personal Time

On-call only works long-term if engineers can actually live their lives during the rotation. These aren’t perks — they’re the minimum for sustainability:
  • No on-call during approved leave. Ever. Swap the rotation.
  • Comp time for after-hours pages. Paged between 10 PM and 7 AM? Take a late start the next morning. No questions, no guilt.
  • Weekend pages get compensated. Time-and-a-half or equivalent time off — non-negotiable for retention.
  • On-call load is tracked per engineer. If one person is getting paged disproportionately, the service has a reliability problem to fix.
  • Quiet hours are respected. If someone isn’t on-call, they should never feel pressure to respond to alerts in Slack.
Track on-call burden per engineer per quarter. If someone’s after-hours page count is more than 2x the team average, investigate the specific alerts causing it. Either fix the underlying reliability issue or rebalance the rotation. On-call fairness is a retention issue dressed up as an operational one.

On-Call Retrospectives

After every rotation, the outgoing engineer writes a brief retro covering: pages received (count, severity, time of day), false positives, runbook gaps, and improvement suggestions. These retros feed into the quarterly alert audit and surface problems continuously — instead of waiting for someone to burn out and quit before you notice. The best on-call systems are invisible to the engineers in them. The pager rarely fires. When it does, the runbook is clear. Escalation is easy. The experience is manageable. That’s never an accident — it’s the result of deliberate design, and it’s worth every minute you invest in getting it right.