Judgment Labs publishes Agent Judge to fix long-context eval failures

Judgment Labs published a methodology called Agent Judge on May 29, describing an agentic evaluation harness built to address the specific ways LLM-as-judge breaks down when trajectories grow long and stateful.

The diagnosis is structural. A single-shot LLM judge works acceptably when the evaluation fits cleanly in context: hand the judge the query, the final output, and a rubric, then ask it to score. That model holds for short, text-in-text-out interactions. It fails for production agents that spend dozens of tool calls updating a CRM, editing source files, opening pull requests, and querying databases before returning a result. The trajectory runs longer than the judge’s reliable reasoning window. Evidence gets truncated. Stateful side effects go unverified. The judge scores the claimed action, not the actual one.

Agent Judge addresses this through three capabilities.

Search converts a long trajectory into a queryable object. Rather than pasting an entire trace into one prompt, the harness dispatches reader agents to inspect targeted slices. A worker agent might track which database record the agent accessed; a second checks whether a later decision depended on stale evidence. Agent Judge can also search across prior trajectories to detect recurring failure patterns, not just isolated errors in the current run.

Verification connects the evaluation to live or recorded system state. The judge checks whether the agent’s claimed actions actually happened. Did the pull request open? Did the API call succeed? Did the CI run pass? Agent Judge inspects captured tool evidence, including API responses, logs, and database confirmations, then aligns each claim against source-of-truth systems such as GitHub, ticketing platforms, or MCP servers. The benchmark table in the Judgment Labs post illustrates a concrete case: the agent reported a bug was fixed; CI still showed a failing test. A single-shot LLM judge reading only the trajectory would have scored that as a pass.

Adaptation addresses rubric drift through a component called Rubric Builder. As models, tools, and user workflows change, a fixed rubric over-penalizes improved behavior and stops catching new failure modes. Rubric Builder compares evaluated trajectories against human feedback, pairwise preferences, and production signals, then proposes focused updates: adding a missing criterion, tightening vague language, or removing a rule producing false positives. The rubric becomes a versioned artifact updated alongside the agent rather than rewritten by hand each time behavior shifts.

The broader framing resonates with two eval patterns this publication has tracked recently. OpenAI’s macro-eval workflow, covered on May 22, uses LLM judges as a first-pass triage layer inside a larger funnel. Spotify’s evals-as-funnel framework treats automated scoring as a cost-reduction mechanism that routes difficult cases to humans. Agent Judge fits into that same architectural slot but adds system-level verification and autonomous search, making it more appropriate for agents with real-world side effects than for constrained text-generation pipelines.

The results Judgment Labs reports are notable. After five Rubric Builder refinement passes, Agent Judge with a refined rubric reached 0.86 accuracy, 0.88 recall, and 0.79 F1 against human expert labels on internal production traffic from coding agents. Claude Code and Codex running as coding-agent harnesses scored 0.73 and 0.69 accuracy respectively. Standard GPT-5.4 LLM judge reached 0.74 accuracy. One caveat belongs here: Judgment Labs built and benchmarked Agent Judge against its own methodology, using internal traffic labeled by its own human experts. The evaluation framework, the rubric, the difficulty metric, and the test set all originate from the same organization that has a commercial product built on this approach. Independent replication against third-party corpora has not been published.

Teams running evals on agents that touch external systems, specifically those with database writes, API calls, file edits, or any action where the claimed result and the actual result can diverge, should treat the search and verification pillars as the meaningful additions here. The adaptation loop is genuinely useful but requires production traffic volume to function well. Teams shipping a new agent with limited run history will get less value from Rubric Builder in the early weeks than a team maintaining an agent that has processed thousands of trajectories.

Published by Judgment Labs on 2026-05-29.

Judgment Labs publishes Agent Judge to fix long-context eval failures

The morning brief for people inside the AI industry.

More in Tools

Musk says SpaceX is shipping a custom C-based AI training stack soon

Delta Weight Sync cuts trillion-parameter RL training transfer by 1000x

Google adds shareable Projects to Gemini for Business