Skip to main content
A service level indicator is only useful if it correlates with actual user experience. The most common mistake: instrumenting what’s easy to measure rather than what users actually feel. These are the indicators I use across my products, with implementation details and honest notes on where each one falls short.

The Four Categories of SLIs

Every service needs at minimum a latency SLI and an availability SLI. Add quality and business SLIs when you have user-facing features with meaningful quality dimensions (AI outputs, recommendation quality, search relevance).

Latency SLIs

API Latency (p95 / p99)

The most common SLI. Measures how fast your API responds.
// Express middleware to record request duration
import client from 'prom-client';

const httpDuration = new client.Histogram({
  name: 'http_request_duration_ms',
  help: 'HTTP request duration in milliseconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [5, 10, 25, 50, 100, 250, 500, 1000, 2500, 5000, 10000]
});

app.use((req, res, next) => {
  const end = httpDuration.startTimer();

  res.on('finish', () => {
    end({
      method: req.method,
      route: req.route?.path ?? 'unknown',
      status_code: res.statusCode.toString()
    });
  });

  next();
});
Prometheus query for p95 over 28 days:
histogram_quantile(
  0.95,
  sum(rate(http_request_duration_ms_bucket{route="/api/orders"}[28d])) by (le)
)
What to watch for:
  • Measure at the load balancer, not the application — application-level measurement misses network overhead
  • Exclude 4xx errors from latency SLIs — slow clients shouldn’t penalize your SLO
  • Track p50, p95, and p99 separately — high p99 with normal p95 means outlier requests, not broad slowness

Frontend Latency (Core Web Vitals)

For user-facing pages, API latency isn’t what users experience. Core Web Vitals are.
MetricWhat it measuresGood threshold
LCP (Largest Contentful Paint)How fast the main content loads< 2.5s
INP (Interaction to Next Paint)How fast the page responds to clicks< 200ms
CLS (Cumulative Layout Shift)How much the page jumps around< 0.1
// web-vitals measurement in Next.js
import { onLCP, onINP, onCLS } from 'web-vitals';

function sendToAnalytics({ name, value, id }: Metric) {
  // Send to your observability backend
  analytics.track('web_vital', {
    metric: name,
    value: Math.round(name === 'CLS' ? value * 1000 : value),
    id,
    page: window.location.pathname
  });
}

onLCP(sendToAnalytics);
onINP(sendToAnalytics);
onCLS(sendToAnalytics);

Availability SLIs

Request Success Rate

The fundamental availability measurement:
availability = successful_requests / total_requests
What counts as “successful” is the decision you need to make:
  • HTTP 2xx only? (strict)
  • HTTP 2xx + 3xx? (includes redirects)
  • HTTP 2xx + 3xx + 4xx? (4xx are client errors, not your fault)
I use HTTP 5xx as the error signal for my server availability SLI. 4xx errors are client errors — my API responded correctly by returning 4xx.
# Availability over 28-day rolling window
1 - (
  sum(rate(http_requests_total{status_code=~"5.."}[28d]))
  /
  sum(rate(http_requests_total[28d]))
)

Workflow Completion Rate

For products built around workflows (automation, multi-step forms, AI pipelines), availability SLIs on individual endpoints miss the forest for the trees. A user who successfully calls all APIs but gets a broken output has experienced an availability failure.
// Track workflow completion explicitly
async function runWorkflow(workflowId: string, input: WorkflowInput) {
  const startTime = Date.now();

  metrics.increment('workflow.started', { type: workflowId });

  try {
    const result = await executeWorkflow(workflowId, input);

    metrics.increment('workflow.completed', { type: workflowId });
    metrics.histogram('workflow.duration_ms', Date.now() - startTime, { type: workflowId });

    return result;
  } catch (error) {
    metrics.increment('workflow.failed', {
      type: workflowId,
      error_type: error.constructor.name
    });
    throw error;
  }
}
SLI formula:
completion_rate = workflow.completed / (workflow.started - workflow.in_progress)

Quality SLIs

For AI features, a response that succeeds (HTTP 200) but gives a wrong or unhelpful answer is a quality failure. Standard availability metrics miss this entirely.

User-Rated Quality

The most direct quality signal: ask users.
// Simple thumbs up/down on AI responses
async function recordFeedback(
  responseId: string,
  rating: 'positive' | 'negative',
  userId: string
) {
  await db.insert(feedbackTable).values({
    responseId,
    rating,
    userId,
    createdAt: new Date()
  });

  // Track in metrics for SLI calculation
  metrics.increment('ai_response.feedback', {
    rating,
    response_type: await getResponseType(responseId)
  });
}
SLI formula:
quality_score = positive_ratings / total_ratings (over 7-day window)
Target: > 80% positive on most AI features. Below 70% is a product signal, not just a technical one.

LLM-as-Judge Quality

For high-volume AI features where user feedback is sparse, use an LLM to evaluate output quality automatically:
async function evaluateAIOutput(
  query: string,
  response: string,
  context: string
): Promise<{ score: number; reasoning: string }> {
  const evaluation = await claude.messages.create({
    model: 'claude-haiku-4-5-20251001', // Use cheaper model for eval
    messages: [{
      role: 'user',
      content: `Rate this AI response on relevance (1-5):

Query: ${query}
Context provided: ${context.slice(0, 500)}
Response: ${response}

Return JSON: {"score": N, "reasoning": "..."}`
    }]
  });

  return JSON.parse(evaluation.content[0].text);
}
Sample 5-10% of responses for automated evaluation. Alert when the rolling average drops below threshold.

Business Outcome SLIs

The highest-value SLIs are the ones closest to business outcomes. These go beyond “did the system respond?” to “did users achieve what they came to do?”

Conversion / Completion Rate

// Track funnel completion
const funnelSteps = ['view_pricing', 'start_signup', 'complete_signup', 'first_action'];

analytics.track('funnel_step', {
  step: 'complete_signup',
  userId,
  timeToComplete: Date.now() - signupStartTime
});
SLI formula:
conversion_rate = completed_goal / started_goal (over 7-day window)
A drop in conversion rate often indicates a reliability problem before error rates or latency show it — users give up rather than retrying.

Feature Adoption

adoption_rate = users_who_used_feature / total_eligible_users (over 28 days)
Useful for validating that a feature is discoverable and working. Low adoption on a supposedly well-designed feature usually means something is broken in the happy path.

Freshness / Data Lag

For products with near-real-time data (dashboards, analytics, recommendation feeds):
// Measure how fresh your data is
async function measureDataFreshness(source: string): Promise<number> {
  const latestRecord = await db
    .select({ eventTime: dataTable.eventTime, ingestTime: dataTable.ingestTime })
    .from(dataTable)
    .where(eq(dataTable.source, source))
    .orderBy(desc(dataTable.ingestTime))
    .limit(1);

  if (!latestRecord.length) return Infinity;

  const lagMs = latestRecord[0].ingestTime.getTime() - latestRecord[0].eventTime.getTime();

  metrics.gauge('data.lag_ms', lagMs, { source });

  return lagMs;
}
SLI formula:
freshness_slo: data lag < 5 minutes for 99% of records

Infrastructure Health SLIs

These are supporting SLIs — they don’t directly measure user experience but predict problems before they cause user impact.

Database Connection Pool

// Track pool saturation — high saturation predicts availability failures
const pool = new Pool({ max: 20 });

setInterval(() => {
  metrics.gauge('db.pool.active', pool.totalCount - pool.idleCount);
  metrics.gauge('db.pool.idle', pool.idleCount);
  metrics.gauge('db.pool.waiting', pool.waitingCount);
  metrics.gauge('db.pool.saturation',
    (pool.totalCount - pool.idleCount) / pool.totalCount
  );
}, 10000);
Alert when saturation > 80%. Investigate when saturation > 60% consistently.

Queue Depth

// For BullMQ or similar
const queue = new Queue('email');

setInterval(async () => {
  const counts = await queue.getJobCounts('waiting', 'active', 'failed');

  metrics.gauge('queue.waiting', counts.waiting, { queue: 'email' });
  metrics.gauge('queue.active', counts.active, { queue: 'email' });
  metrics.gauge('queue.failed', counts.failed, { queue: 'email' });

  // Oldest job age — proxy for consumer health
  const oldestJob = await queue.getJobs(['waiting'], 0, 1, false);
  if (oldestJob.length > 0) {
    const ageMs = Date.now() - oldestJob[0].timestamp;
    metrics.gauge('queue.oldest_job_age_ms', ageMs, { queue: 'email' });
  }
}, 30000);
Alert when oldest job age exceeds your processing SLO. A deep queue growing faster than it’s being consumed is a pre-incident signal.

SLI Anti-Patterns

Measuring server health, not user experience. “CPU < 70%” is an infrastructure metric, not a user SLI. A server at 69% CPU serving slow responses is failing users while meeting its “SLI.” 100% availability targets. Nothing is 100% available. A 100% SLO means you’re always in breach or you’ve defined the measurement to exclude all real failures. Too many SLIs. Three well-chosen, actively reviewed SLIs are worth more than fifteen SLIs that nobody checks. Start with latency + availability + one quality metric for your most critical user journey. Measuring the happy path only. An SLI that only counts successful requests misses the users who got errors, timed out, and gave up. Include failure modes in your measurement. SLIs with no owners. Every SLI should have someone accountable for it. “The team owns it” means nobody owns it.