The Problem

Prompt changes, model upgrades, retrieval tweaks all affect agent behavior. Without regression tests, drift is discovered in production. 50–200 test cases covering happy path, edge cases, adversarial, and out-of-scope.

Test Case Structure

Input: user message + context. Expected: topic selected, actions called, response contents. Pass/fail by semantic match or LLM-as-judge for flexibility. Strict string match is too fragile for LLM outputs.

CI Integration

Run suite on every change. Block deploys on regressions past threshold. Auto-run on model version upgrades — these are the changes most likely to cause invisible drift.

Maintenance

Test set rots. Review quarterly. Add cases for production bugs. Remove tests that no longer represent real usage. Unmaintained regression suites become theatrical rather than protective.

Share