Why Chaos Testing
Production AI fails in ways that don’t show in testing. Model endpoints slow, retrieval returns nothing, cost spikes unexpectedly. Chaos engineering verifies graceful degradation before real incidents.
Failure Scenarios
Primary LLM unavailable — falls back to secondary? Vector DB slow — agent degrades to keyword search? Cost rate-limit hit — graceful throttle vs cascading failure? Test these explicitly.
Implementation
Scheduled chaos during business hours (not production-impacting). Automated failure injection. Observability captures behavior. Post-experiment retro — did the system respond as designed?
Cultural Shift
Chaos engineering feels scary. Normalize through small experiments — single service in test environment. Build confidence. Expand scope. Mature practices include game days where teams coordinate broader scenarios.