Harvey, the legal AI startup, published initial results on its Legal Agent Benchmark holdout on May 26 via an X thread from Gabe Pereyra, the company’s AI lead. The results show frontier coding-and-reasoning models performing far below the saturation rates familiar from general-purpose benchmarks like MMLU or SWE-Bench.

Under what Harvey calls an all-pass standard (requiring every rubric criterion to pass for a task to count as solved), Claude Opus 4.7 led the field at 7.1 percent. Sonnet 4.6 followed at 5.4 percent, Opus 4.6 at 4.2 percent, GPT-5.5 at 2.1 percent, and Gemini 3.5 Flash at 0.8 percent. The spread between top and bottom is more than 8x, and the absolute pass rate of even the top performer is below 10 percent. By comparison, the same models cluster in the 70 to 90 percent range on most public coding benchmarks.

The all-pass standard is the structurally interesting part. Many legal tasks are not partially correct or mostly correct; they require every rubric element to be right, because a missed clause, misinterpreted statute, or omitted citation can render the entire output unusable in practice. Harvey’s evaluation reflects how legal workflows actually consume model output: as binary pass/fail on the full task, not as a percentage of sub-tasks completed.

This evaluation also reveals what we have suspected from multiple angles this week: standard benchmarks have lost discriminating power, and domain-specific evaluations that match real workflows are where the genuine signal now lives. The DeepSWE benchmark (also covered today) argues a similar point for coding. Harvey’s argument for legal is the same pattern in a different vertical.

The structural skepticism applies. Harvey is a legal AI company with commercial incentive to demonstrate that legal work remains hard and that general-purpose AI models cannot displace specialized legal tooling. A benchmark designed and run by Harvey, on tasks selected by Harvey, with a rubric defined by Harvey, is signal but not independent verification. The 7.1 percent figure for the leading model is more meaningful as a relative ranking than as an absolute capability statement.

For legal tech buyers, the practical implication is that frontier models are not yet a substitute for specialized legal AI tooling or for experienced legal professionals on tasks that require every detail to be correct. The leading model passes the rubric in roughly 1 of 14 attempts. That is not yet the substitution rate that justifies replacing professional legal work, but it is a meaningful starting baseline for tasks where AI augmentation (rather than replacement) is the deployment model.

The trajectory matters more than the snapshot. If the pass rate moves from 7.1 percent to 20 percent over the next twelve months, the substitution math changes. The Mythos 1 release, expected imminent and likely tested against the same benchmark when it lands, will be the next data point worth watching.

Posted by Gabe Pereyra on X on 2026-05-26.