Skip to main content

SLO Playbook

Service Level Objectives anchor reliability discussions. Here’s how we set them across the Dev, Productivity, and AI pods.

1. Pick the Right SLI First

  • Start with customer journey (sign-in, run workflow, call agent).
  • Choose indicators: latency p95, error rate, freshness, throughput.
  • Validate data availability before committing.

2. Craft the Objective

  • Format: “Measure ≤ Target for % of requests over rolling window”.
  • Example: “Productivity automations p95 latency <= 700 ms for 99% of runs in a 28‑day window”.
  • Tie objective to user outcome (fast automations, reliable agents, etc.).

3. Error Budgets & Guardrails

  • Error budget = 1 - SLO (e.g., 1%).
  • Track burn rates (1h, 6h, 24h). Alert roles when exceeding policy.
  • Link features/experiments to budget consumption.

4. Review Cadence

  • Weekly SLO review: status, burn, actions.
  • Monthly: re-evaluate if SLO still meaningful; adjust targets based on data + business goals.
  • Post-incident: decide whether threshold/indicator needs revision.

5. Tooling

  • Nobl9 / Sloth / custom calculators generate SLO configs for Prometheus.
  • Grafana dashboards with error budget viz + annotation for deployments.
  • Slack bot posts SLO digests with CTA to relevant runbooks.

6. Templates

### SLO Card
- Service:
- Indicator:
- Target:
- Window:
- Alert policy:
- Owners:
- Links: dashboards, runbooks, repos
Use this playbook when onboarding a new service or revisiting objectives for an existing one.