Debug Production Incident

Playbook

How to debug production incidents

Identify the symptom: high CPU, slow API, lag, timeout.
Check logs and recent deployment.
Check metrics: CPU, memory, network, query time.
Check dependencies: database, Kafka broker, external API.
Identify root cause: missing index, slow query, leak, spike, misconfiguration.
Fix and prevent: index, retry, more monitoring, configuration hardening.