On-Call Without Burnout: Designing Rotations That Don't Destroy Your Team

Why On-Call Burns People Out
Rotation Design
Writing Good Runbooks: The 3 AM Test
Alert Quality Over Quantity
Escalation Policies
Protecting Personal Time
On-Call Retrospectives

Team collaboratively planning their on-call rotation and processes

On-call is the tax engineers pay for running production systems. And like any tax, if it’s designed poorly, people resent it, avoid it, and eventually leave. I’ve been on both sides — the engineer getting paged six times in one night for alerts that didn’t matter, and the lead designing a rotation that people don’t dread. The difference is never the services. It’s always the rotation design, the alert quality, and the runbook culture. If your on-call is burning people out, that’s not a people problem. It’s a design problem. And design problems have solutions.

“Monitoring is for asking questions. Alerting is for demanding answers. Know the difference.” — Charity Majors

Why On-Call Burns People Out

It’s rarely one thing. It’s the compounding effect of several, and understanding them is the first step to fixing them.

Burnout Factor	What It Looks Like	The Real Cost
Alert fatigue	15 pages in a night, 12 are false positives	Engineers stop trusting the pager — then miss the real alert
Unclear runbooks	Paged at 3 AM with no idea what to do	Longer incidents, more stress, slower resolution
No escalation path	Stuck with no one to call for help	4-hour solo incidents, isolation, resentment
No compensation	On-call is “just part of the job”	Your best engineers leave for companies without pagers
Uneven distribution	One person gets paged 3x more than others	Feels unfair because it is unfair — it’s a reliability problem masquerading as a rotation problem

Every on-call system I’ve improved started by honestly diagnosing which of these factors was doing the most damage.

If your team has fewer than 5 engineers, think carefully about whether that team should own on-call at all. A 4-person rotation means on-call every 4 weeks with no backup. One person on leave breaks the whole system. Consider a shared rotation across teams instead.

Rotation Design

The structure of the rotation matters more than most teams realize. Here’s what works at different team sizes:

Team Size	Rotation Style	On-Call Frequency	Sustainability
3-4	Weekly rotation	Every 3-4 weeks	Risky — one person leaving breaks it
5-7	Weekly rotation	Every 5-7 weeks	Sweet spot for most teams
8-12	Weekly rotation + backup	Every 8-12 weeks	Comfortable, room for shadow shifts
12+	Follow-the-sun	Varies by timezone	No after-hours pages — the gold standard

Handoff rituals matter. The transition between on-call shifts should be a short sync, not a Slack message. Cover these four things in 15 minutes:

Active incidents or concerns — anything simmering that might escalate
Recent deploys — what shipped this week that might cause issues
Alert noise — known false positives or noisy alerts being worked on
Personal context — “I have an appointment Wednesday morning” so the backup knows when to be attentive

It feels like overhead until the first time it prevents an incident.

Writing Good Runbooks: The 3 AM Test

The bar for a good runbook is simple: can a sleep-deprived engineer who didn’t build this service follow it successfully? If the answer is no, rewrite it. Runbook quality checklist:

Explains what the alert means in plain language (not just “CPU is high” but why it matters and what it correlates with)
Links to the specific dashboard — not “check Grafana” but the actual URL with the relevant panel
Provides a decision tree — “if X, do Y; if Z, do W” — so the on-call doesn’t have to diagnose from scratch
Includes exact commands to run, not vague instructions
Has a clear escalation path with names or team handles
Lists related runbooks for adjacent failure modes
Shows a “last updated” date so you know if it’s stale
Has been tested by someone who didn’t write it

Include actual commands, actual links, and actual dashboard URLs. “Check the metrics” is useless. “Open [this specific dashboard URL] and look at the ‘Connection Pool’ panel in the top-right” saves minutes, and minutes matter at 3 AM.

Alert Quality Over Quantity

The single most impactful change you can make to an on-call system is reducing alert volume — not by suppressing alerts, but by making every alert meaningful. Run a quarterly alert audit. For each alert that fired in the last 90 days, ask:

Question	If “No”…
Did it require human action?	Automate the response or remove the alert
Was the right action obvious from the alert?	Improve the runbook
Did it genuinely need to page someone after hours?	Downgrade to business-hours-only
Was it a true positive?	If false positive rate > 10%, fix the threshold or delete it

On a project I worked on, our first alert audit reduced pages from 45 per week to 12. The 33 we removed were false positives, informational alerts that should have been dashboard metrics, or alerts for conditions that self-healed within minutes. Consider SLO-based alerting. Instead of alerting on individual symptoms (“CPU > 80%”, “memory > 90%”), alert on error budget burn rate. If your SLO is 99.9% availability, an alert fires when you’re burning through your error budget faster than sustainable. CPU spikes that don’t affect users? No page. Brief latency bumps within SLO? No page. Only sustained degradation triggers a human response.

Escalation Policies

A clear escalation path prevents the on-call engineer from being a single point of failure. It’s not punishment — it’s support.

Time	What Happens
0 min	Primary on-call paged
5 min (no ack)	Secondary on-call paged
10 min (no ack)	Team lead paged
15 min (acked but unresolved)	On-call decides: handle solo or escalate for help
30 min (unresolved SEV 1)	Engineering manager paged
60 min (unresolved SEV 1)	Senior leadership notified

When I get paged as an escalation, my job is to help, not to judge. The goal is resolution, not blame.

Protecting Personal Time

On-call only works long-term if engineers can actually live their lives during the rotation. These aren’t perks — they’re the minimum for sustainability:

No on-call during approved leave. Ever. Swap the rotation.
Comp time for after-hours pages. Paged between 10 PM and 7 AM? Take a late start the next morning. No questions, no guilt.
Weekend pages get compensated. Time-and-a-half or equivalent time off — non-negotiable for retention.
On-call load is tracked per engineer. If one person is getting paged disproportionately, the service has a reliability problem to fix.
Quiet hours are respected. If someone isn’t on-call, they should never feel pressure to respond to alerts in Slack.

Track on-call burden per engineer per quarter. If someone’s after-hours page count is more than 2x the team average, investigate the specific alerts causing it. Either fix the underlying reliability issue or rebalance the rotation. On-call fairness is a retention issue dressed up as an operational one.

On-Call Retrospectives

After every rotation, the outgoing engineer writes a brief retro covering: pages received (count, severity, time of day), false positives, runbook gaps, and improvement suggestions. These retros feed into the quarterly alert audit and surface problems continuously — instead of waiting for someone to burn out and quit before you notice. The best on-call systems are invisible to the engineers in them. The pager rarely fires. When it does, the runbook is clear. Escalation is easy. The experience is manageable. That’s never an accident — it’s the result of deliberate design, and it’s worth every minute you invest in getting it right.

Incident Response: The First 15 Minutes That Define Everything

Why I Stopped Chasing 100% Test Coverage (And What I Measure Instead)

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Career & Engineering Leadership

Shipping & DevOps

Testing & Quality

Observability

On-Call Without Burnout: Designing Rotations That Don't Destroy Your Team

Why On-Call Burns People Out

Rotation Design

Writing Good Runbooks: The 3 AM Test

Alert Quality Over Quantity

Escalation Policies

Protecting Personal Time

On-Call Retrospectives

TypeScript at Scale

Design Systems

Deep Dives

System Design & Architecture

Career & Engineering Leadership

Shipping & DevOps

Testing & Quality

Observability

​Why On-Call Burns People Out

​Rotation Design

​Writing Good Runbooks: The 3 AM Test

​Alert Quality Over Quantity

​Escalation Policies

​Protecting Personal Time

​On-Call Retrospectives

Why On-Call Burns People Out

Rotation Design

Writing Good Runbooks: The 3 AM Test

Alert Quality Over Quantity

Escalation Policies

Protecting Personal Time

On-Call Retrospectives