The Third Scaling Regime Nobody Is Measuring

A Stanford researcher argues that prompt, context, and harness optimization is sample-efficient learning with its own compute curve , and the field is ignoring it.

Alessandro Benigni

PUBLISHED JUN 11, 2026

3 MIN READ

Follow on Google

YESTERDAY

The Third Scaling Regime Nobody Is Measuring — featured image for AI Insiders

The ML research community has spent a decade arguing about where learning happens. Yoonho Lee, a researcher at Stanford, published a blog post on June 9 arguing the community has been asking the wrong question.

Lee’s core claim is that optimizing the text layer around a model , prompts, context windows, retrieval indices, memory stores, agent harnesses , is not prompt engineering in the colloquial, low-status sense. It is a legitimate update mechanism, one that changes system behavior in response to new information just as gradient descent does. The difference is efficiency. A single well-constructed prompt revision propagates across every subsequent call. The equivalent behavioral change via supervised fine-tuning would require far more labeled examples to reach the same result.

This matters because the field has organized its understanding of capability progress around two compute axes. Train-time compute covers model size and training tokens; the Chinchilla-style scaling laws live here. Test-time compute covers chain-of-thought, reasoning budgets, and inference-time search; the o1 and DeepSeek-R1 generation of models made this axis legible. Lee argues for a third: update-time compute, measured by how much compute a system spends revising its text layer between deployments. This compute does not appear in any published scaling curve.

The framing connects directly to what is showing up empirically across the industry this week. A DX study found 8 percent throughput gains on pull requests from AI tooling. A Perplexity research piece documented 87 percent reductions in task completion time on certain knowledge workflows. The K-Dense argument circulating in developer circles positions the workflow itself, not the model, as the durable competitive moat. Lee is providing the theoretical frame underneath all of these findings: the gap between what a model is capable of in principle and what a deployed system actually delivers is closed not by training a better model, but by optimizing the text layer around the model you already have.

He is explicit that this view is contested. The strongest counterargument is amortization: training a behavior into weights means the system does not have to carry its specification in every context window. Lee accepts this for stable, broadly useful behaviors. His rebuttal is that many production behaviors are volatile, user-specific, or not yet trusted enough to lock into weights. The text layer is, in his framing, a “staging ground” where behavioral hypotheses can be tested before committing them to the model. The companies doing this most visibly , Anthropic with its constitutional approach, Cursor with its benchmark-driven harness, Harvey with its domain-specific context layers , confirm the pattern is real at scale.

The piece is positioned as a research agenda, not a settled empirical result. Lee acknowledges that text optimization is methodologically immature: the barrier to tinkering is low, which makes bad science common, and the field lacks rigorous benchmarks that isolate text-layer contribution while controlling for model capability. He names CL-bench and TerminalBench-2 as early attempts, but notes the evaluation infrastructure is nowhere near what exists for weight-based learning.

The observability tools that teams are already buying (Braintrust, LangSmith, Helicone) are functioning, in Lee’s framing, as de facto update-time-compute infrastructure. They are measuring prompt iteration cycles, trace analysis, and harness performance. Nobody is calling this a scaling axis yet. Lee is saying it should be.

The practical implication for operators is narrow but concrete. If update-time compute scales and the compute budgets currently allocated to text-layer iteration are, as Lee puts it, “orders of magnitude smaller” than weight post-training budgets, then teams that invest systematically in harness and context optimization now are compounding an advantage the industry has not yet priced. For any team treating prompt work as a maintenance task rather than a research program, Lee’s post is the case for reconsidering that allocation before the rest of the field formalizes the curve.

Based on a blog post by Yoonho Lee (yoonholee.com), published June 9, 2026.

The Third Scaling Regime Nobody Is Measuring

The morning brief for people inside the AI industry.

More in Opinion

CoreWeave says compute isn't a commodity. He's right, and he's selling.

Flat-fee AI plans lose money on power users, and agents make it worse

The laptop model problem that should worry every AI vendor