The reward problem: why RL stalls outside math and code

Reinforcement learning has cracked coding and math by exploiting clean, checkable answers. Unlocking the rest of the economy requires solving the verification problem first.

Alessandro Benigni

PUBLISHED JUL 1, 2026

4 MIN READ

Follow on Google

-1218 MIN AGO

$The reward problem: why RL stalls outside math and code — featured image for AI Insiders$

Dario Amodei, CEO of Anthropic, says he is 90 percent confident AI reaches the capability of “a country of geniuses in a data center” within ten years. His residual uncertainty, he told host Dwarkesh Patel on a recent podcast, comes down to tasks that cannot be verified: planning a Mars mission, discovering the next CRISPR, writing a novel. That single caveat maps the entire frontier for the next generation of AI training.

The structural reason for the gap is what Jason Wei, then at OpenAI, called the “verifier’s law”: how easily you can train an AI system to do a task is roughly proportional to how checkable the answer is. Math proofs either follow or do not. Code either passes tests or fails. Both properties make RL with verifiable rewards (RLVR) a compounding engine. OpenAI and Google DeepMind each scored 35 out of 42 at the 2025 International Math Olympiad, gold-medal level, on problems most strong undergraduates cannot touch. SWE-bench coding scores have climbed in parallel. The reward signal is cheap, clean, and can run millions of times.

Tanay Jaipuria, a partner at Wing VC whose analysis this piece is based on, frames the limitation precisely: most economically valuable work has no test suite. A good memo, a sound business strategy, a new drug candidate cannot be scored in milliseconds. RLHF and Anthropic’s Constitutional AI both addressed the original alignment version of this problem, but Jaipuria argues they optimized for engagement and safety rather than capability uplift. The gains RLVR delivered for coding have not materialized for writing, reasoning about strategy, or scientific discovery.

Three technical approaches are now trying to close that gap. The first is rubric-based rewards. Scale AI published a paper in mid-2025 describing per-prompt checklists anchored to domain experts: instead of asking an LLM judge “is this good,” you ask “does it address X, avoid Y, cover Z.” Each sub-question is close to binary, which makes the aggregate score tractable as a reward signal. Scale reported up to a 31 percent relative gain on HealthBench, a medical benchmark, over plain judge scoring. Follow-on work called OpenRubrics is focused on generating these checklists at scale. The second approach is generative reward models, where the judge reasons before scoring rather than outputting a number directly. The third is process reward models, which score intermediate reasoning steps rather than only the final answer, which matters more as tasks grow longer and less crisp.

The common logic across all three: when you cannot build a programmatic checker, you decompose the judgment into smaller questions that approximate one.

On the commercial side, Jaipuria identifies three distinct strategies. Companies like Mercor, Surge, and Micro1 are selling verifiers and training data to labs, using expert humans to write rubrics concrete enough to score at scale. Taste Labs is targeting design and aesthetics, areas RLHF has historically flattened by averaging everyone’s preferences into blandness. A second cluster is formalizing domains: Pramaana Labs is applying formal verification methods to tax, law, and healthcare, so that answers can be checked by machine the way a Lean proof checks itself. DeepMind’s AlphaProof does the same for advanced mathematics. A third cluster owns the physical loop entirely. Periodic Labs, founded by ex-OpenAI and DeepMind researchers, runs robotic labs to discover new materials. Isomorphic Labs, the DeepMind drug-discovery spinout, grounds predictions in wet-lab and clinical results. Lila Sciences builds autonomous labs across life and materials science. When the verification is a physical experiment, the reward is real but slow and expensive; owning the lab means you control the feedback loop.

What this means in practice: the companies best positioned to deploy AI in healthcare, law, finance, and scientific discovery are not just those with the largest models. They are those that can instrument their domains well enough to generate training signal. Building that instrumentation, whether via rubrics, formal verification, or physical labs, is itself a defensible position. Teams evaluating AI adoption in regulated industries should ask not only which model to use, but whether they can construct a verifier for their highest-value tasks. Without one, capability gains may plateau at the same ceiling that has limited RLHF for years.

Analysis by Tanay Jaipuria, published June 30, 2026, at tanayj.com.

The reward problem: why RL stalls outside math and code

The morning brief for people inside the AI industry.

More in Opinion

Salesforce Is Paying $300M a Year to a Rival It Built Into Its Own Platform

The Transformer Becomes a Commodity Layer

AI compute cost tracks the economic value of work