Debugging Production: A Staff Engineer’s Playbook

The Slack message comes in at 2:47 PM on a Tuesday: “Something is wrong with checkout. Users are reporting errors.” Your heart rate goes up. You open your laptop. What do you do first? I’ve been on the receiving end of this message dozens of times across Atlassian, Bugcrowd, and Weel. Each time, the instinct is to dive into code. The instinct is wrong. Production debugging is not about code — it’s about information gathering, hypothesis testing, and communication. The code is the last thing you look at. Here’s the playbook I’ve developed over 15 years.

The First 5 Minutes

The first 5 minutes of a production incident determine whether you resolve it in 30 minutes or 3 hours. Here’s my exact sequence: Minute 1: Confirm the problem is real. Check monitoring dashboards. Is error rate actually elevated? Is latency spiking? Sometimes “users are reporting errors” means one user saw one error once. Don’t spin up an incident for a single transient error. Minute 2: Assess blast radius. How many users are affected? All users? Users in a specific region? Users on a specific plan? Is this a complete outage or a degradation? This determines severity and who needs to be notified. Minute 3: Check what changed. Open your deployment log. Was there a deploy in the last hour? A config change? A feature flag toggle? A third-party service update? The answer to “what changed?” resolves 70% of production issues. Minute 4: Communicate. Post in the incident channel: what you know, what you don’t know, and what you’re doing next. Even if the answer is “investigating — will update in 15 minutes.” Silence during an incident is worse than bad news. Minute 5: Decide — rollback or investigate? If a deploy happened recently and the timing correlates, roll back first, investigate later. Don’t spend 45 minutes debugging when a 2-minute rollback would restore service.

The biggest mistake in incident response is continuing to investigate when you should be mitigating. If you can restore service with a rollback, config change, or feature flag toggle, do that FIRST. Debug the root cause after users are happy.

The Triage Framework

Once the immediate response is handled, I use a structured triage framework. The goal is to systematically narrow down possibilities.

Level 1: Is it us or them?

Before debugging your code, confirm the problem is in your system.

Check third-party status pages. AWS, Stripe, Auth0, Datadog — whatever you depend on. A third-party outage disguised as your bug has wasted countless engineering hours.
Check infrastructure. Is the database healthy? Is CPU/memory normal? Are there network issues?
Check DNS and CDN. Sometimes the problem is that CloudFront is serving stale content or a DNS change hasn’t propagated.

# Quick checks
curl -o /dev/null -s -w "%{http_code} %{time_total}s\n" https://yourapp.com/health

# Check if the issue is regional
curl -o /dev/null -s -w "%{http_code}\n" --resolve yourapp.com:443:SPECIFIC_IP https://yourapp.com/health

Level 2: Where in the stack?

If it’s your system, narrow down the layer:

Layer	Signals
CDN/Edge	Stale content, wrong headers, 403s
Load balancer	502/503 errors, uneven distribution
Application	Error logs, slow queries, exceptions
Database	Connection pool exhaustion, slow queries, locks
External service	Timeout errors, elevated latency to specific endpoints
Client-side	JavaScript errors, rendering failures, network errors

Level 3: When did it start?

Correlate the start time with events:

Deployment timestamps
Config changes
Feature flag toggles
Cron job executions
Traffic spikes
Third-party API changes

I keep a #deployments channel where every deploy, config change, and flag toggle is automatically posted. During an incident, scrolling back through this channel is the fastest way to find the trigger.

Reading Logs Like a Detective

Logs are your primary evidence. But reading logs is a skill — you need to know what to look for and how to filter out noise.

Start with the error, work backward

Don’t start from the beginning of the request. Start from the error and trace backward:

# Find the error
ERROR [2026-02-26T14:47:23Z] PaymentService: 
  Charge failed for user_8a3f — Stripe returned 402

# Now find the request that caused it
grep "user_8a3f" logs | grep "14:47"

# Trace the full request lifecycle
grep "req_id=abc123" logs | sort

Look for patterns, not individual errors

A single error is noise. A pattern is signal.

# Count errors by type in the last hour
grep "ERROR" logs | grep "14:" | 
  awk '{print $4}' | sort | uniq -c | sort -rn

# Output:
# 847 PaymentService: Charge failed
#  12 AuthService: Token expired
#   3 EmailService: Send failed

847 payment failures is a pattern. 12 token expirations is normal churn. 3 email failures is noise. Focus on the 847.

Correlation is not causation

The most dangerous trap in log analysis: assuming that because two things happen at the same time, one caused the other. A memory spike at 14:45 and payment errors at 14:47 might be related. Or the memory spike might be from a cron job that runs every hour and has nothing to do with payments. Verify causation by:

Finding the causal chain in the logs (A led to B led to C)
Reproducing the issue in a staging environment
Checking if the supposed cause has happened before without the effect

Reproducing Issues

Reproducing a production bug in a controlled environment is half the battle. Once you can reproduce it, you can debug it.

The reproduction hierarchy

Unit test — Fastest feedback loop. Can you write a test that fails with the same error?
Local environment — Can you hit the same code path locally with the same inputs?
Staging environment — Can you reproduce with production-like data and config?
Production (safely) — Can you trigger the issue with a test account in production?

When you can’t reproduce

Some bugs only manifest under production conditions — specific data, specific load, specific timing. For these:

Add logging. Deploy additional structured logging around the suspected code path. Make the logs temporary and behind a flag.
Use feature flags for debugging. Enable verbose logging for a specific user or a percentage of traffic.
Check for race conditions. If the bug is intermittent, it’s often a race condition. Think about what concurrent operations could interfere with each other.

// Temporary debug logging behind a flag
async function processPayment(userId: string, amount: number) {
  const debugMode = await getFeatureFlag('debug-payments', userId);
  
  if (debugMode) {
    logger.info('Payment debug', {
      userId,
      amount,
      timestamp: Date.now(),
      stripeConfig: getStripeConfig(),
      userState: await getUserPaymentState(userId),
    });
  }
  
  // ... normal payment logic
}

Structured logging is worth 10x unstructured logging during an incident. If your logs are JSON with consistent fields (userId, requestId, traceId, service, duration), you can query them. If they’re unstructured strings, you’re grep-ing blind.

Bisecting Deployments

When you know the issue started after a deploy but before the next one, and the deploy contains 20 PRs, you need to bisect.

The manual approach

Check the deploy’s PR list
Group by risk: database migrations, API changes, and third-party integrations are higher risk than UI tweaks
Read the high-risk PRs first — usually 3-5 of the 20 are worth investigating
Check if any PR touched the code path in the error logs

The automated approach

If your CI supports it, deploy individual commits to staging and run the reproduction steps:

# List commits in the deploy
git log --oneline v2.45.0..v2.46.0

# Binary search for the breaking commit
git bisect start
git bisect bad v2.46.0
git bisect good v2.45.0
# Deploy each commit git bisect suggests, test, mark good/bad
git bisect good  # or git bisect bad

Git bisect in production debugging is underrated. It’s especially effective when combined with automated smoke tests that can verify the reproduction case.

The “What Changed?” Checklist

I keep this checklist bookmarked. During any production issue, I run through it: Nine times out of ten, the answer to “what changed?” is the answer to “what broke?”

Communication During Incidents

Technical debugging is half the job. The other half is communication. Here’s the communication framework I use:

The incident channel template

Post updates every 15-30 minutes, even if nothing has changed:

**Incident Update — 15:30 AEDT**

**Status:** Investigating
**Impact:** ~5% of checkout attempts failing with "Payment Error"
**Duration:** Started ~14:45 AEDT (45 min ago)

**What we know:**
- Stripe is returning 402 errors for a subset of payment attempts
- Affected users have cards issued by [specific bank]
- No code deploys since 11:00

**What we're doing:**
- Checking Stripe status page and contacting support
- Reviewing recent Stripe API changes
- Testing with affected card types in staging

**Next update:** 16:00 AEDT or sooner if status changes

Who to notify and when

Severity	Notify	When
SEV1 (full outage)	Engineering lead, CTO, customer support, status page	Immediately
SEV2 (major degradation)	Engineering lead, team lead, customer support	Within 15 minutes
SEV3 (minor degradation)	Team channel	Within 30 minutes
SEV4 (cosmetic/low impact)	Ticket created	Next business day

The golden rule of incident communication

Never say “we’re looking into it” without saying what you’re looking into. “We’re investigating whether the recent Stripe API update caused checkout failures for AMEX cards” is 100x better than “we’re investigating.” Specificity builds trust.

Blameless Postmortems

Every SEV1 and SEV2 incident gets a postmortem. The goal is learning, not blame. Here’s the template I use:

## Incident Postmortem: Checkout Payment Failures
**Date:** 2026-02-26  
**Duration:** 14:45 - 16:20 AEDT (95 minutes)  
**Severity:** SEV2  
**Author:** [Your name]  

### Summary
~5% of checkout payments failed due to a Stripe API behavior change 
affecting 3DS authentication for cards from ANZ Bank.

### Timeline
- 14:45 — First error alerts fire in #alerts-payments
- 14:52 — Engineer begins investigation  
- 15:10 — Identified that only ANZ-issued cards are affected
- 15:25 — Found Stripe changelog noting 3DS flow change
- 15:40 — Hotfix deployed to handle new 3DS response format
- 16:00 — Error rate returns to baseline
- 16:20 — Incident resolved, monitoring confirmed

### Root Cause
Stripe changed the 3DS authentication response format for certain 
card issuers. Our payment handler expected `three_d_secure.status` 
but the new format uses `three_d_secure_result.status`. The code 
fell through to the error handler.

### What Went Well
- Alerts fired within 5 minutes of first error
- Team identified the affected user segment quickly
- Hotfix was deployed within 50 minutes

### What Could Be Better
- We didn't monitor the Stripe changelog for breaking changes
- Our payment handler didn't have a fallback for unknown response formats
- Integration tests didn't cover 3DS edge cases

### Action Items
- [ ] Add Stripe changelog to weekly review (owner: @payments-team)
- [ ] Add defensive parsing for Stripe responses (owner: @alice)
- [ ] Expand payment integration test suite (owner: @bob)
- [ ] Set up contract tests against Stripe's API (owner: @payments-team)

The most important section of a postmortem is “Action Items.” If you write a postmortem without actionable follow-ups, it’s just storytelling. Every item should have an owner and a deadline.

Building Debug-Friendly Systems

The best debugging happens before the incident — by building systems that are easy to debug.

Structured logging from day one

// Every log entry should have these fields
logger.info('Payment processed', {
  requestId: req.id,
  traceId: req.traceId,
  userId: user.id,
  action: 'payment.process',
  amount: payment.amount,
  currency: payment.currency,
  provider: 'stripe',
  duration: Date.now() - startTime,
  success: true,
});

Request tracing

Generate a unique traceId at the edge and propagate it through every service. When something goes wrong, grep for the traceId and you have the complete request lifecycle across all services.

Health checks that mean something

A health check that returns 200 when the database is down is worse than no health check. Test real dependencies:

app.get('/health', async (req, res) => {
  const checks = await Promise.allSettled([
    db.query('SELECT 1'),
    redis.ping(),
    fetch('https://api.stripe.com/v1', { method: 'HEAD' }),
  ]);

  const status = checks.every(c => c.status === 'fulfilled') ? 200 : 503;
  
  res.status(status).json({
    status: status === 200 ? 'healthy' : 'degraded',
    checks: {
      database: checks[0].status,
      cache: checks[1].status,
      stripe: checks[2].status,
    },
    timestamp: new Date().toISOString(),
  });
});

Feature flags for everything

Every feature should be behind a flag. Not because you’re doing A/B testing — because you need a kill switch. When a feature causes a production issue, toggling a flag is 100x faster than deploying a revert.

Error boundaries with context

In React applications, error boundaries should capture and report context:

class ErrorBoundary extends React.Component<Props, State> {
  static getDerivedStateFromError(error: Error) {
    return { hasError: true, error };
  }

  componentDidCatch(error: Error, info: React.ErrorInfo) {
    reportError({
      error,
      componentStack: info.componentStack,
      route: window.location.pathname,
      userId: this.props.userId,
      buildVersion: process.env.BUILD_VERSION,
      timestamp: new Date().toISOString(),
    });
  }

  render() {
    if (this.state.hasError) {
      return <ErrorFallback error={this.state.error} />;
    }
    return this.props.children;
  }
}

The Debugging Mindset

After hundreds of production incidents, I’ve distilled my approach to a few principles:

Mitigate first, debug second. Restore service as fast as possible. You can always investigate the root cause later.
Follow the data, not your intuition. Your gut says “it’s the database.” The logs say it’s a DNS issue. Trust the logs.
Eliminate possibilities systematically. Don’t jump to conclusions. Work through the checklist. Narrow down the layer, the time window, the code path.
Time-box your investigation. If you’ve spent 30 minutes and haven’t found the root cause, escalate. Fresh eyes find things tired eyes miss.
Write it down. Every incident is a learning opportunity, but only if you capture the lessons. The postmortem is not optional.

Production debugging is a skill that improves with practice. Every incident you handle makes you faster at the next one. The playbook helps, but experience is irreplaceable.

Full-Stack TypeScript

Design Systems

Engineering Deep Dives

Development

Best Practices

Observability

Debugging Production: A Staff Engineer's Playbook

Debugging Production: A Staff Engineer’s Playbook

The First 5 Minutes

The Triage Framework

Level 1: Is it us or them?

Level 2: Where in the stack?

Level 3: When did it start?

Reading Logs Like a Detective

Start with the error, work backward

Look for patterns, not individual errors

Correlation is not causation

Reproducing Issues

The reproduction hierarchy

When you can’t reproduce

Bisecting Deployments

The manual approach

The automated approach

The “What Changed?” Checklist

Communication During Incidents

The incident channel template

Who to notify and when

The golden rule of incident communication

Blameless Postmortems

Building Debug-Friendly Systems

Structured logging from day one

Request tracing

Health checks that mean something

Feature flags for everything

Error boundaries with context

The Debugging Mindset

Full-Stack TypeScript

Design Systems

Engineering Deep Dives

Development

Best Practices

Observability

​Debugging Production: A Staff Engineer’s Playbook

​The First 5 Minutes

​The Triage Framework

​Level 1: Is it us or them?

​Level 2: Where in the stack?

​Level 3: When did it start?

​Reading Logs Like a Detective

​Start with the error, work backward

​Look for patterns, not individual errors

​Correlation is not causation

​Reproducing Issues

​The reproduction hierarchy

​When you can’t reproduce

​Bisecting Deployments

​The manual approach

​The automated approach

​The “What Changed?” Checklist

​Communication During Incidents

​The incident channel template

​Who to notify and when

​The golden rule of incident communication

​Blameless Postmortems

​Building Debug-Friendly Systems

​Structured logging from day one

​Request tracing

​Health checks that mean something

​Feature flags for everything

​Error boundaries with context

​The Debugging Mindset

Debugging Production: A Staff Engineer’s Playbook

The First 5 Minutes

The Triage Framework

Level 1: Is it us or them?

Level 2: Where in the stack?

Level 3: When did it start?

Reading Logs Like a Detective

Start with the error, work backward

Look for patterns, not individual errors

Correlation is not causation

Reproducing Issues

The reproduction hierarchy

When you can’t reproduce

Bisecting Deployments

The manual approach

The automated approach

The “What Changed?” Checklist

Communication During Incidents

The incident channel template

Who to notify and when

The golden rule of incident communication

Blameless Postmortems

Building Debug-Friendly Systems

Structured logging from day one

Request tracing

Health checks that mean something

Feature flags for everything

Error boundaries with context

The Debugging Mindset