The tests we built to measure AI capability are running out of ceiling. In a recorded conversation published this week, Tejal Patwardhan, who works on evaluations at OpenAI, laid out the core problem: frontier models are scoring high enough on established benchmarks that the benchmarks no longer tell researchers, builders, or buyers much about which system is actually better or where the real limits are.

Benchmark saturation is not a new observation. The community watched it happen with ImageNet, then with GLUE, then with MMLU. Each time, a benchmark starts as a genuine test of capability, becomes the optimization target, and eventually becomes a ceiling that most competitive models bump against. What is new is the pace and the stakes. When saturation happens at the frontier, where differences between systems translate directly into deployment decisions worth hundreds of millions of dollars, the measurement gap has real costs.

The saturation pattern has a predictable shape. A benchmark is released when the best models score, say, 40 to 60 percent. Researchers treat it as signal. Labs train on data that overlaps with the benchmark distribution, sometimes inadvertently, sometimes deliberately. Within one to two generations of frontier models, scores cluster in the high 80s and 90s. The variance between top systems shrinks to within noise. At that point, a benchmark score tells you almost nothing about which system will perform better on your actual workload.

Patwardhan’s discussion pointed toward several directions the field is moving in response. The emphasis is shifting toward dynamic and contamination-resistant evaluations: benchmarks that are harder to train into, that involve genuine reasoning over novel problem structures, or that are generated fresh rather than drawn from static held-out sets. There is also growing interest in task-specific evals that are closer to actual deployment conditions, where the question is not “can the model pass a hard test” but “can the model do the job a user needs done, reliably, at scale.”

Forecasting model progress is the harder adjacent problem. If your measurement instrument saturates before the capability it is measuring saturates, you lose the ability to predict where models are heading. This is not just a research problem. Builders making infrastructure and product decisions need to anticipate capability jumps, and a flat benchmark curve that masks real underlying progress is actively misleading. Labs are aware of this, which is part of why OpenAI and others have invested in harder agentic and multi-step evaluation suites, where a model has to complete extended tasks rather than answer discrete questions.

The honest caveat is that every proposed solution to benchmark saturation tends to eventually face the same problem: once it becomes a published standard, it becomes an optimization target. Dynamic evals are harder to game, but they are also harder to compare across labs and over time. Task-specific evals are more meaningful but harder to aggregate into a single number that the industry can reason about. The field does not yet have a clean answer.

For builders, the practical implication runs ahead of the research debate. Standard published benchmark scores are becoming less useful as procurement and architecture signals at the frontier tier. Teams that need to differentiate between top-tier models should invest in building internal evals that match their specific workloads, including failure-mode testing, latency-under-load tests, and multi-turn consistency checks. The labs themselves are pointing in this direction, and Patwardhan’s framing reinforces it.

The benchmark you are relying on to pick your next model may be telling you less than you think.

Based on a recorded conversation with Tejal Patwardhan of OpenAI, published June 17, 2026.