Why Ad-Hoc Testing Fails
An agent “feels right” during manual testing. You try 10 prompts, they all work, you ship. Two weeks later, users find the 11th case where it confidently does the wrong thing.
Agents fail in ways that deterministic code doesn’t. Minor phrasing changes route to the wrong topic. Edge-case inputs pass silently through reasoning that looked fine in trace. A vendor model update shifts behavior overnight.
The only reliable defense is a proper evaluation discipline: a test set you run before every change, with metrics you track over time.
Anatomy of a Test Set
Each test case has:
- Input: the user message or inputs as they’d arrive in production.
- Expected behavior: the topic that should be selected, the actions that should fire, the substance the response should contain.
- Grading criteria: how to determine if the actual behavior met expectations.
Sizes depend on the agent’s scope. For a single-topic agent, 50 cases is a floor. For a multi-topic agent, 200+ before launching.
Coverage Buckets
Structure your test set by scenario type.
1. Happy Path
The canonical user journeys your agent was built for. Most common phrasings, well-formed inputs, clear intents.
Target: 95%+ success rate.
2. Edge Cases
Unusual but realistic inputs. Very long questions, multiple intents in one message, ambiguous references, non-English tokens mixed in.
Target: 80%+ success rate. Failures here are expected but shouldn’t be catastrophic.
3. Adversarial
Attempts to jailbreak, prompt-inject, or manipulate the agent. Users asking it to do things outside scope. Prompts trying to override instructions.
Target: 100% refusal rate for genuine adversarial cases.
4. Out-of-Scope
Questions the agent shouldn’t answer at all. A case-management agent being asked about HR policy.
Target: 100% graceful deflection to a human or clear “I can’t help with that.”
5. Regression
Cases that broke in the past. Every bug found in production becomes a regression test.
Target: 100% success — these are the things you explicitly fixed.
Grading Methods
Different criteria require different graders.
Deterministic Grading
For outcomes that are binary: did the agent call Create_Case? Does the response contain the case number? Did it refuse the out-of-scope request?
Write assertions that inspect the reasoning trace and response directly. Fast, cheap, reliable.
LLM-as-Judge
For subjective quality — “was the tone professional?”, “did it answer the question clearly?” — use an LLM to grade.
Pattern: the grader LLM receives the user prompt, the agent response, and a rubric. Returns a score (0–5 or pass/fail).
Use with discipline. LLM graders are themselves imperfect. Calibrate against human grading on a subset.
Human Grading
For final gates, occasional audits, and building trust. A subject-matter expert reviews a sample of outputs and assigns scores.
Expensive. Use for weekly or monthly quality audits, not for every change.
Metrics to Track
Beyond pass/fail:
Topic accuracy. Percentage of inputs routed to the correct topic. The most fundamental metric — if routing is wrong, everything downstream is wrong.
Action selection accuracy. Given the right topic, did the agent choose the right action(s)?
Argument accuracy. Did it pass correct arguments to the actions? A right action with wrong arguments is still broken.
Response quality. Subjective, LLM-graded or human-graded. Measures readability, accuracy, helpfulness.
Latency. p50, p95, p99 response times. User experience metric.
Refusal rate on adversarial. Percentage of adversarial cases refused correctly.
Regression rate. Percentage of regression test cases still passing.
Track these per release. A small regression in any metric may be acceptable if something else improved; silent drift is not.
Running Evaluations
Manual approach: sit in Agent Builder, paste prompts, grade by eye. Adequate for small test sets, tedious for real ones.
Programmatic approach: use Agent Evaluation APIs (or Agent Builder’s test surface when it supports batched runs) to execute the test set and collect traces.
Persist results. A CSV of “prompt, expected topic, actual topic, pass/fail, trace” per run is a minimum record.
Pre-Change Gate
Before any agent change deploys:
- Run the full test set against the changed configuration in sandbox.
- Compare metrics against the baseline (last accepted run).
- Require no regressions below defined thresholds.
- Spot-check a sample of failures manually.
- Only then deploy.
“Changes passed CI” is not the same as “changes passed agent eval.” Build the latter discipline.
Production Monitoring
Test sets are for pre-deploy. Production monitoring catches drift.
- Sample conversations. Log a percentage (1–5%) for review.
- User feedback signals. Thumbs up/down, escalation-to-human, conversation abandonment.
- Tool failure rates. Actions erroring at higher rates than usual.
- Content safety flags. Trust Layer blocks, masking activations, toxicity detections.
Alert on anomalies. A spike in escalations may mean a model drift you didn’t anticipate.
Red-Teaming
For any customer-facing agent, red-team it before shipping. Specific goals:
- Attempt to extract sensitive internal data.
- Attempt to cause embarrassment (off-brand responses, factual claims the company doesn’t stand behind).
- Attempt to get the agent to take unauthorized actions.
- Try prompt injection through user-supplied data.
Have a security-minded colleague run a planned red-team session. Budget a day. The findings are worth disproportionately more than routine eval.
Change Log Discipline
Every agent change gets:
- What changed (topic instruction edit, new action, model routing).
- Why (customer request, production bug, experiment).
- Eval results before and after.
- Deployment timestamp and author.
A spreadsheet or lightweight tool is fine. The discipline is what matters. When something breaks in production, the change log points at what shifted recently.
Frequently Asked Questions
How much time should I budget for eval?
For initial agent launch, 20–30% of the total build time should be eval work. Ongoing, each change should have eval overhead proportional to blast radius.
Does Salesforce provide evaluation tools?
Agent Builder includes a test surface. Programmatic evaluation APIs exist but are maturing. Many teams supplement with custom tooling.
Can I eval LLM-judged quality consistently?
With calibration, yes. Align the grader on a small human-graded sample; iterate the grader’s rubric until its scores correlate with human scores.
How do I know my test set is complete?
You don’t, conclusively. Keep adding cases as you find new failure modes in production. Coverage grows asymptotically over time.