Skip to main content

SLI Recipes

Indicators are the ingredients of a good SLO. Here are the ones we rely on most.

1. Latency

  • Definition: time from request receipt to successful response (p50/p90/p95).
  • Implementation: OpenTelemetry spans with http.server.duration, aggregated in Prometheus.
  • Dashboard: contrast vs Core Web Vitals (frontend) and API calls.

2. Reliability / Error Rate

  • Definition: (failed requests / total requests) per service + endpoint.
  • Implementation: HTTP status buckets, GraphQL errors, workflow job failures.
  • Tip: separate user errors vs platform errors to avoid noisy metrics.

3. Freshness / Data Lag

  • Definition: time between source event and availability in consumer system.
  • Use case: Productivity dashboards fed from automations; Thinki.sh community stats.
  • Implementation: event timestamp vs ingest timestamp; alert if > threshold.

4. Business Outcomes

  • Definition: completion rate of key workflows (productivity ritual, AI agent task, Thinki.sh challenge).
  • Implementation: Fire event to Segment/PostHog; compute ratio vs started.
  • Note: great for product-led SLOs.

5. Experience Metrics

  • Frontend: Core Web Vitals (LCP, INP, CLS) via Web-Vitals JS + Analytics.
  • Mobile: App start time, error-free sessions (Sentry), offline success rate.

6. Storage + Queue Health

  • Useful for n8n and automation heavy flows.
  • Track backlog size, oldest message age, failure retries.
Document SLIs in /runbooks/sli/<service>.mdx with query links so anyone can trace data lineage.