Playbook

How to debug production incidents

  1. Identify the symptom: high CPU, slow API, lag, timeout.
  2. Check logs and recent deployment.
  3. Check metrics: CPU, memory, network, query time.
  4. Check dependencies: database, Kafka broker, external API.
  5. Identify root cause: missing index, slow query, leak, spike, misconfiguration.
  6. Fix and prevent: index, retry, more monitoring, configuration hardening.