Alerting

Alert on: agent error rate spike, Trust Layer block rate rise, cost anomaly, latency SLA breach, user handoff rate jump. Each signal different problem classes.

On-Call Roles

Primary: responds to alerts within SLA. Secondary: escalation when primary stuck. AI Ops lead: escalation for systematic issues needing architecture change. Clear roles; don’t wing it.

Playbooks

Per alert type. Cost spike → identify runaway agent, throttle, investigate. Block rate spike → audit recent prompt changes, revert if needed. Latency breach → scale model endpoint, fall back to smaller model if possible.

Post-Incident

Every incident gets a retro. Root cause beyond ‘the AI was wrong’ — what in our system allowed wrongness to cause user impact. Blameless, concrete action items with owners.

Share