Playbook
How to debug production incidents
- Identify the symptom: high CPU, slow API, lag, timeout.
- Check logs and recent deployment.
- Check metrics: CPU, memory, network, query time.
- Check dependencies: database, Kafka broker, external API.
- Identify root cause: missing index, slow query, leak, spike, misconfiguration.
- Fix and prevent: index, retry, more monitoring, configuration hardening.