Alerting
Alert on: agent error rate spike, Trust Layer block rate rise, cost anomaly, latency SLA breach, user handoff rate jump. Each signal different problem classes.
On-Call Roles
Primary: responds to alerts within SLA. Secondary: escalation when primary stuck. AI Ops lead: escalation for systematic issues needing architecture change. Clear roles; don’t wing it.
Playbooks
Per alert type. Cost spike → identify runaway agent, throttle, investigate. Block rate spike → audit recent prompt changes, revert if needed. Latency breach → scale model endpoint, fall back to smaller model if possible.
Post-Incident
Every incident gets a retro. Root cause beyond ‘the AI was wrong’ — what in our system allowed wrongness to cause user impact. Blameless, concrete action items with owners.