Evals · LLMOps

Evaluating AI Agents: Beyond Vibes and Demos

Prabhakar Gupta · Principal AI Architect · 29 Apr 2026 · 7 min read

Every team can demo an agent. Almost none can answer the only question that matters in an enterprise: how often does it succeed, on what, and how do you know last week's change didn't make it worse?

Agent evaluation is harder than model evaluation because the unit isn't an answer — it's a trajectory: a sequence of decisions, tool calls, and recoveries. Two runs can reach the same correct answer, one cleanly in 4 steps, one through 19 steps of flailing that cost 12× the tokens and got lucky. Scoring only the final answer hides the difference until production exposes it.

01Build the golden set first

Before launch, collect 100–200 real tasks from the actual workload — not the friendly ones, the gnarly ones: ambiguous requests, missing data, conflicting documents, tasks whose correct outcome is "refuse and escalate." For each, record the expected outcome and the constraints a valid trajectory must respect (which tools may be called, what must be cited, what may never be touched). This set is your system's definition of correct. No golden set, no deploy — that's the discipline line between a system and a demo.

02Score three layers

Outcome: did it complete the task — exact-match where possible, LLM-as-judge with a tight rubric where not (and calibrate your judge against ~50 human-labelled samples before trusting it; judges have biases — verbosity, position, self-preference). Trajectory: right tools, sane step count, no forbidden actions, loops terminated. Cost & latency: a regression in either is a regression, full stop. Wire all three into CI: every prompt change, model upgrade, or tool modification runs the harness, and a drop blocks the merge exactly like a failing unit test.

The unglamorous multiplier

Pipe a sample of real production traces back into the eval set weekly — production finds the cases your imagination didn't. The golden set is a living asset; teams that let it freeze are evaluating last quarter's problem.

"If you can't measure it, you didn't build it — you got lucky near it."The line I open every evals workshop with

Bottom line: the eval harness is not overhead on the road to shipping — it is the road. It's also, conveniently, the artifact that turns a nervous risk-committee meeting into a short one.

No spam. Unsubscribe anytime. New Tuesdays.

Build systems, not demos

My live 8-week Agentic AI course covers all of this in working code — batch 01 starts 7 July, limited to 50 seats.

View the course →