The standard way to evaluate a language model, put it on a benchmark, record a score, and ship it, breaks down completely when the model is an agent operating over dozens of sequential steps in a live environment. Cameron R. Wolfe, writing on his Substack, published a detailed guide to agent evaluation that makes the structural problem clear: the eval tooling available to most teams was designed for a fundamentally different kind of system.
The gap is not academic. Agents deployed in production coding environments or clinical decision support tools are making sequences of decisions where a single wrong step can compound into a catastrophic outcome. A benchmark score from a static multiple-choice task does not tell you how a system behaves when it must plan across a ten-step tool-calling chain, encounter unexpected inputs midway through, and recover without human intervention.
The field is responding slowly. Most evaluation frameworks still measure final-answer accuracy or task completion rate on isolated prompts. When the same logic is applied to an agent, the harness fails to capture intermediate failures, recovery behavior, or cascading errors. A system that scores 87 percent on a static coding benchmark may fail on 40 percent of multi-step refactor tasks where context accumulates across tool calls. Those numbers come from informal internal comparisons across teams, not from any published study, but the pattern is consistent enough that practitioners cite it openly.
What rigorous agent evaluation actually requires is distinct from what most teams are building. Wolfe’s framework points toward realistic harnesses: environments that simulate the noise and variability of production conditions, not cleaned-up academic datasets. These harnesses need to observe intermediate steps, not just terminal outputs. They need to measure reliability across long time horizons rather than single-turn performance. And they need to be outcome-oriented: did the agent accomplish the goal correctly, not just generate a plausible-looking output?
The comparison to software testing is useful here. Unit tests measure isolated components; integration tests measure how components interact under load. The industry spent fifteen years building the infrastructure to run reliable integration tests at scale. Agent evaluation is at the unit-test stage, except the stakes are closer to a production system than a hello-world function. The lack of a shared evaluation standard is not a philosophical problem; it is a product risk that is already materializing in agent deployments where teams do not discover failures until a customer reports one.
Two factors are making this more urgent now. First, the deployment of agents in high-stakes domains, specifically autonomous coding assistants and medical reasoning tools, is moving faster than the evaluation tooling. Second, agent capability is improving at a rate that makes last quarter’s eval framework irrelevant. A harness designed for a three-step agent running in a sandboxed environment does not generalize to a thirty-step agent with file system access and external API calls.
For builders shipping agent products today, the practical consequence is straightforward. If your evaluation pipeline does not include a realistic multi-step harness that observes intermediate state and measures recovery from mid-task failures, you are flying without instruments. Adding that harness before your next major release is not a research project; it is a quality gate that your customers are already expecting you to have.
This article draws on analysis published by Cameron R. Wolfe, Ph.D. on Cameron R. Wolfe’s Substack; the piece is undated.