SLI Recipes
Indicators are the ingredients of a good SLO. Here are the ones we rely on most.1. Latency
- Definition: time from request receipt to successful response (p50/p90/p95).
- Implementation: OpenTelemetry spans with
http.server.duration, aggregated in Prometheus. - Dashboard: contrast vs Core Web Vitals (frontend) and API calls.
2. Reliability / Error Rate
- Definition:
(failed requests / total requests)per service + endpoint. - Implementation: HTTP status buckets, GraphQL errors, workflow job failures.
- Tip: separate user errors vs platform errors to avoid noisy metrics.
3. Freshness / Data Lag
- Definition: time between source event and availability in consumer system.
- Use case: Productivity dashboards fed from automations; Thinki.sh community stats.
- Implementation: event timestamp vs ingest timestamp; alert if > threshold.
4. Business Outcomes
- Definition: completion rate of key workflows (productivity ritual, AI agent task, Thinki.sh challenge).
- Implementation: Fire event to Segment/PostHog; compute ratio vs started.
- Note: great for product-led SLOs.
5. Experience Metrics
- Frontend: Core Web Vitals (LCP, INP, CLS) via Web-Vitals JS + Analytics.
- Mobile: App start time, error-free sessions (Sentry), offline success rate.
6. Storage + Queue Health
- Useful for n8n and automation heavy flows.
- Track backlog size, oldest message age, failure retries.
/runbooks/sli/<service>.mdx with query links so anyone can trace data lineage.