Detect

Monitoring catches most incidents before customers report. When customer reports beat monitoring, that’s a monitoring gap — fix it in the retro. Detection latency is a measured outcome.

Contain

Kill switch for agents must exist and be one-click. Fallback to human or simpler system. Don’t debug while customers hit broken agent. Contain first; investigate second.

Communicate

Status page updates within minutes. Affected users notified appropriately. Leadership and support teams briefed. Silence during incidents erodes trust more than the incident itself.

Learn

Blameless retro within 5 business days. Timeline, root cause, contributing factors, action items with owners. Track action item completion — unfollowed actions mean the learning didn’t stick.

Share