RLVR Has a Wall. Most Labs Are Pretending It Does Not.

The bet that scaling reinforcement learning from verifiable rewards reaches AGI runs into a hard constraint: most skills worth having cannot be simulated in a datacenter.

Alessandro Benigni

PUBLISHED JUN 30, 2026

4 MIN READ

Follow on Google

-1146 MIN AGO

RLVR Has a Wall. Most Labs Are Pretending It Does Not. — featured image for AI Insiders

The dominant research strategy at every major AI lab is a wager: train models on millions of verifiable tasks across thousands of sandboxed environments, and you get a general intelligence. Dwarkesh Patel, writing on his blog on June 19, 2026, argues this bet is more fragile than its advocates admit.

The case for RLVR (reinforcement learning from verifiable rewards) is straightforward. Coding and math improved dramatically once labs could run thousands of parallel agents against deterministic test suites and score every rollout automatically. The optimists say everything eventually becomes a coding or math problem with enough clever prompt engineering. Scale the environments, scale the compute, generalize.

Patel uses computer use as a counterexample. Ordering something on Etsy is verifiable. Did the item get purchased or not. Simple binary. But you cannot spin up a thousand simultaneous checkout attempts against Amazon.com. Andy Jassy will block the bots. You would have to clone every major web application at high fidelity just to create a grindable training target. This is labor-intensive, slow, and currently unscalable.

The computer-use slowdown reveals a structural constraint. For RLVR to work well, a domain needs two properties simultaneously: it must be verifiable, and it must be replayable at scale with a deterministic simulator. Coding passes both tests. Politics does not. Winning a court case does not. Building a company from scratch does not. These involve interacting with a world that is non-stationary, reset-free, and measured over months or years, not seconds.

Patel is clear this is not a new observation in the RL literature. Reset-free non-stationary environments have been a known open problem for years. What he is doing is mapping that academic constraint directly onto the AGI timeline. If the skills that define human expertise in high-stakes domains cannot be trained via RLVR, then the labs’ current playbook hits a ceiling well before AGI.

The second problem is memory. Patel argues that in-context learning, while improving, is not a substitute for continual learning. A model that processes a week of work in context and then resets has learned nothing durable. All those signals, every mistake, every correction, every piece of organization-specific tacit knowledge, vanishes. Roughly 30 to 50 percent of lab compute goes to inference, and that inference compute is currently producing no persistent improvement in the underlying model. The value is discarded.

Gradient updates are the obvious fix. Write the learning back into the weights. But gradient updates are sample-inefficient: you need millions of examples of the same thing before a useful signal accumulates, which is why Cursor’s Tab model can online-learn only one very specific objective (which edits got accepted) across 400 million requests per day. Personalizing or specializing a model’s weights from a single deployment session is not something anyone has shipped at meaningful scale.

Patel proposes two candidate paths. One is on-policy self-distillation (OPSD): after a long session, train the base model to match the predictions of the session-informed version. The session acts as a teacher; the base model is the student. Unlike naive supervised fine-tuning, which would have the model memorize every token of every transcript, OPSD concentrates the gradient update on what actually changed the outcomes. Unlike naive RL, it does not require an external verifiable reward at all.

The second idea is speculative: “dreaming,” where a model builds its own internal simulation of the domain it is working in and rehearses skills against that simulation before deployment. Patel points to EfficientZero, a model that matched human two-hour Atari performance by playing many more simulated games in its head than real ones. Generalizing that to a full world simulation is, he acknowledges, a much harder problem.

For teams betting their roadmap on RL fine-tuning today, the practical implication is sharper than it might appear: RLVR is a strong foundation for domains where you can build a test suite, but if your agent’s value proposition is judgment in open-ended, real-world contexts, you are building on a method that has no training signal for the hardest cases your users will present.

Argued by Dwarkesh Patel on June 19, 2026, on his blog at dwarkesh.com.

RLVR Has a Wall. Most Labs Are Pretending It Does Not.

The morning brief for people inside the AI industry.

More in Opinion

AI compute cost tracks the economic value of work

The engineer of three is here. Now what do you do with the third?

Scaling Laws Have a Hidden Catch. Lilian Weng Lays It Out.