Benchmarks are lying to your procurement team

Noam Brown argues that plotting LLM performance against test-time compute exposes capability gaps that single-scalar release scores bury entirely.

Alessandro Benigni

PUBLISHED JUN 11, 2026

3 MIN READ

Follow on Google

YESTERDAY

Benchmarks are lying to your procurement team — featured image for AI Insiders

The way the industry reports model benchmarks is now actively misleading buyers. That is the core claim Noam Brown, co-creator of OpenAI o1, made in a thread posted June 9 on X, and the argument deserves more attention than it received.

Brown’s point is mechanical rather than political. LLM performance is no longer a function of model identity alone. It is a function of how much compute is spent at inference time. When you compare GPT-5.5 against GPT-5.4 at maximum compute and report a single number, you get near-parity. When you plot both models against tokens spent on a task as the x-axis, GPT-5.5 is substantially stronger at every fixed budget. The models are not close. The reporting method made them look close.

The UK AI Security Institute’s cyber evaluation of GPT-5.5 illustrates the problem concretely. At the hardest tier of a 95-task benchmark, GPT-5.5 lands near-parity with Claude Mythos Preview on max-compute figures. That sounds like a coin flip between two models. The same data plotted against compute budget tells a different story: at constrained inference spend, GPT-5.5 outperforms GPT-5.4 by margins the single-bar chart cannot surface. UK AISI also found no performance plateau on the hardest tasks when inference compute scales up, a finding that has held since basic tasks saturated in February 2026.

Brown’s methodological recommendation is specific: benchmark developers should produce performance-versus-compute curves, not one bar per model. Tokens, cost, and latency are all valid x-axes. The single-scalar score belongs to an era when inference was roughly fixed per query. That era ended when reasoning models made extended chain-of-thought the variable you tune for the task.

This argument sits alongside the position Yoonho Lee made about update-time compute as the optimization substrate for capability. Lee’s frame is that capability accumulates at training time through iterated text-based updates. Brown’s frame is that deployed capability only becomes legible when inference compute is the lens. Both arguments reach the same structural conclusion: current benchmark reporting flattens a multi-dimensional reality into one number, and that number conceals more than it reveals each release cycle.

The procurement implication is immediate and concrete. Enterprise teams currently comparing Claude Fable 5, GPT-5.5, and Gemini 3.5 are working from benchmark tables that Brown’s analysis says are methodologically broken for their actual use case. The table answers the question “which model scores higher at max compute?” Most buyers are not deploying at max compute. They are deploying at a cost-per-call budget, a latency constraint, or a token cap set by their product team. The model that wins the benchmark table may not be the model that wins at their specific operating point on the compute curve.

Brown is addressing benchmark developers directly, but the people who need to hear it are the buyers and the engineers doing model evaluations before contract renewals. Running a model eval that does not vary inference compute is now producing results that systematically under-inform the decision. The fix is not technically complex: run the candidate models at three to five compute budgets that bracket your production setting and compare performance across that range rather than at a single, maximum point.

The single-scalar benchmark will persist because it produces a clean leaderboard and clean leaderboards generate press coverage. That is not a methodological argument; it is an incentive structure. Brown is asking the field to accept a chart that is harder to headline in exchange for information that is actually useful. The buyers who run their own curves before the next contract cycle will make better decisions than the ones who wait for the field to reform itself.

Any team currently running model evaluations ahead of a 2026 enterprise renewal should treat performance at their production compute budget as the primary metric, and any vendor benchmark score that lacks a compute axis as a starting point for investigation rather than a final answer.

Noam Brown on X (OpenAI o1 co-creator), posted June 9, 2026.

Benchmarks are lying to your procurement team

The morning brief for people inside the AI industry.

More in Opinion

CoreWeave says compute isn't a commodity. He's right, and he's selling.

Flat-fee AI plans lose money on power users, and agents make it worse

The laptop model problem that should worry every AI vendor