Only one model, GPT 5.2, succeeded at a task that should be well within the reach of frontier AI: designing a benchmark that other frontier models actually find hard to solve. Every other model tested, including Claude Opus 4.6 and GPT 5.5, either built problems too easy for their peers or constructed puzzles with no valid solution at all.

The finding comes from BenchBench, introduced by Rohit Krishnan on Strange Loop Canon on May 22. The setup is straightforward: give each model a survey of existing benchmarks, then ask it to invent a new one that is simultaneously novel, practically solvable, and discriminating enough to separate frontier models from each other. Krishnan ran multiple rounds, feeding each model the previous round’s failures so it could learn and iterate.

The results exposed a gap between Solver ability and Creator ability that standard leaderboards do not capture. The models that rank highest on almost every existing benchmark performed worst as benchmark designers. Claude Opus produced clean, elegant problems that were easy to solve precisely because they were well-structured. GPT 5.5 gravitated toward procedural rule tasks that leaned too heavily on exact schemas. GPT 5.4 built policy scenarios that collapsed into checklists. GPT 5.4 was, notably, the best Solver of the other models’ benchmarks, even as it failed as a Creator.

GPT 5.2 succeeded with a contribution Krishnan calls Reimbursement Forensics: given a set of travel expense packets with voided receipts and duplicates, produce a single reimbursable total in cents. The task is messy, specific, and requires navigating real-world ambiguity rather than applying a clean algorithm. That design choice is what separated it from the field.

What BenchBench is actually measuring is a form of theory of mind applied to other models. To design a discriminating benchmark, a model must reason about what its peers know, where they are likely to fail, and how to exploit those gaps without making the task unsolvable. That is a different cognitive operation from answering questions, and the results suggest the two capabilities do not reliably travel together.

Gemini 3.1 Pro stands out in the data as the most creative Creator in Krishnan’s tests. It produced spatial traversal tasks and lease reconciliation problems that separated solvers better than most. Krishnan’s assessment is that Google has a genuinely strong model that does not receive the attention it merits. Gemini 3.5 Flash was a better Creator than Claude Opus despite ranking below it on standard benchmarks.

The structural caveat here is significant. BenchBench is one researcher’s experiment, tested across a limited model set, with rounds designed and judged by the same author. It produces a clear signal, but not proof. Krishnan himself acknowledges the benchmark needs to be run at scale with more models before its rankings should be treated as settled. The result is a hypothesis worth testing, not a new leaderboard to trust outright.

The benchmark-saturation context makes the hypothesis worth taking seriously. Frontier models now saturate most standard evaluations within months of their release. The practical question for eval design has shifted from whether a model can answer hard questions to whether it can generate hard questions that others cannot answer. BenchBench addresses that directly, and the fact that the two rankings diverge so sharply from standard leaderboards suggests it is measuring something real.

Teams selecting models for roles that require generating novel tests, synthetic training data, or adversarial evaluation sets should not assume that the highest-ranked benchmark solver will also be the best benchmark designer. Krishnan’s data suggests those are separate capabilities worth evaluating independently before committing to a model for creative or strategic generation work.

Posted by Rohit Krishnan on Strange Loop Canon on 2026-05-22.