SLO Playbook
Service Level Objectives anchor reliability discussions. Here’s how we set them across the Dev, Productivity, and AI pods.1. Pick the Right SLI First
- Start with customer journey (sign-in, run workflow, call agent).
- Choose indicators: latency p95, error rate, freshness, throughput.
- Validate data availability before committing.
2. Craft the Objective
- Format: “Measure ≤ Target for % of requests over rolling window”.
- Example: “Productivity automations p95 latency <= 700 ms for 99% of runs in a 28‑day window”.
- Tie objective to user outcome (fast automations, reliable agents, etc.).
3. Error Budgets & Guardrails
- Error budget = 1 - SLO (e.g., 1%).
- Track burn rates (1h, 6h, 24h). Alert roles when exceeding policy.
- Link features/experiments to budget consumption.
4. Review Cadence
- Weekly SLO review: status, burn, actions.
- Monthly: re-evaluate if SLO still meaningful; adjust targets based on data + business goals.
- Post-incident: decide whether threshold/indicator needs revision.
5. Tooling
- Nobl9 / Sloth / custom calculators generate SLO configs for Prometheus.
- Grafana dashboards with error budget viz + annotation for deployments.
- Slack bot posts SLO digests with CTA to relevant runbooks.
