The Purpose of an SLO
An SLO is not primarily about measuring reliability. It’s about enabling a conversation between product and engineering about how reliable is reliable enough. Without SLOs:- “We should be more reliable” — but what does that mean?
- “This deploy is risky” — but risky compared to what?
- “We’re spending too much time on reliability” — says who?
- “We have 35% of our error budget remaining this month”
- “This deploy historically adds 0.5% error rate. We can afford 2% before we breach the objective”
- “We’ve breached the SLO twice this quarter — reliability work is justified”
The Process: From Customer Journey to Alert
Step 1: Start With the User, Not the System
The most common mistake: starting with “what can we easily measure?” instead of “what does the user actually experience?” Start here:| Customer journey | What success looks like |
|---|---|
| User signs in | Login completes in < 2 seconds |
| User runs a workflow | Workflow completes without error |
| AI agent processes a request | Response within 5 seconds, correct output |
| User uploads a document | Upload succeeds, processing completes within 30 seconds |
Step 2: Choose Your SLI
The indicator is the measurable proxy for user experience. Latency SLI:Step 3: Set the Objective
The format:p95 API latency < 500ms for 99% of requests over 28 daysError rate < 0.5% over 7-day rolling windowWorkflow completion rate > 98% over 28 daysAI response quality rating > 4.0/5.0 over 7-day rolling average
| Window | Use when |
|---|---|
| 7-day rolling | Fast-moving products, frequent deploys |
| 28-day rolling | Stable products, monthly business cycles |
| Calendar month | Financial reporting alignment |
Error Budget: The Decision Tool
The error budget is what makes SLOs operational. It answers: “how much can we fail this month?”| SLO | Monthly budget (minutes) | Weekly budget (minutes) |
|---|---|---|
| 99.9% | 43.2 min | 10.1 min |
| 99.5% | 216 min | 50.4 min |
| 99.0% | 432 min | 100.8 min |
| 95.0% | 2,160 min | 504 min |
Error Budget Decision Framework
This is the conversation the SLO enables. Not “should we be reliable?” but “given our remaining budget, which option do we take?”Templates
The SLO Card
For each service, maintain a one-page SLO card:The Prometheus SLO Config (with Sloth)
Common Mistakes
Setting aspirational targets, not honest ones. An SLO set at 99.99% that you currently achieve 99.5% will be permanently breached. Start at your current performance, then tighten over time as the system improves. Not connecting SLOs to user impact. “p99 database query time < 10ms” isn’t a user SLO — users don’t experience database queries directly. Connect it to user experience: “page load time < 2 seconds at p95.” Measuring availability without measuring quality. A feature that responds successfully but gives wrong answers is worse than one that’s temporarily unavailable. Add quality SLIs for AI features. Ignoring the error budget. If nobody checks the error budget before deploying, the SLO is decoration. Add “check error budget” to your deployment checklist. Annual reviews. Review SLOs quarterly at minimum. User expectations change, product evolves, and an SLO that was ambitious a year ago might be embarrassingly low now.The Monthly SLO Review
30 minutes, once a month, with the engineering lead and product manager:- Status: Which SLOs are we meeting? Which did we breach?
- Budget: How much error budget did we consume? On what?
- Trends: Are we getting more or less reliable over time?
- Target review: Are current targets still the right targets?
- Action items: What reliability investments are justified by the data?
