The most consequential decision in a frontier model training run is how to split a fixed compute budget between model size and training tokens. That decision is guided almost entirely by scaling laws, and Anthropic head of safety Lilian Weng published a careful technical audit of them on June 24, 2026, that every team spending on GPU clusters should read.
The core idea is decades old. Training loss falls predictably as you increase model parameters (N), training data (D), or total compute (C), tracing a power-law curve that looks like a straight line on a log-log plot. The practical value is real: run a handful of small experiments, fit the curve, and extrapolate to estimate the token and compute requirements for a model ten or a hundred times larger, without paying the full cost upfront.
Two papers define the field. Kaplan et al. (2020) from OpenAI established the foundational power-law relationships and concluded that model size should be scaled considerably faster than dataset size when compute is the constraint. That finding shaped years of training decisions at multiple labs. Hoffmann et al. (2022) at DeepMind then ran a cleaner experiment and reached the opposite conclusion: given a fixed compute budget, model size and training tokens should scale in roughly equal proportion. Their model, Chinchilla, was trained on far more tokens than same-compute contemporaries and outperformed models several times its parameter count.
The Chinchilla result has since become the dominant planning heuristic. Labs describe training runs as “compute-optimal” when they follow the implied data-to-parameter ratio from Hoffmann et al. But Weng’s essay identifies where that framing breaks down.
Scaling laws are fit on loss, not on downstream task accuracy or benchmark performance. Loss and capability are correlated but not identical. A model trained to a lower loss does not always outperform a higher-loss model on tasks that matter to operators. The extrapolation assumes the power-law relationship holds at larger scale, but the curve is fit from small runs where the range of C, N, and D is far narrower than the target regime.
The compute-optimal framing also assumes that training compute is the only cost that matters. For teams that plan to run a model at inference for months or years, that assumption is wrong. A smaller model trained on more tokens costs more to train but less to serve. The optimal training point shifts dramatically once inference cost enters the budget. Weng’s essay surfaces this tension without resolving it: the right N-to-D ratio depends on the deployment context, and most published scaling laws were derived without specifying that context.
Data constraints are a related problem the field has not solved. Scaling laws assume you can acquire as many tokens as optimal training requires. At the data volumes implied by compute-optimal recipes for frontier-scale runs, that assumption is increasingly false. Labs are training on repeated epochs of their data or on synthetic data, and neither case maps cleanly onto the scaling curves that were fit on single-epoch, high-quality text.
The honest summary of Weng’s position is that scaling laws are useful navigation tools, not physics. They were derived empirically on specific architectures, specific data distributions, and specific compute regimes. When any of those change, the curves shift. The Kaplan-to-Chinchilla reversal happened because the experimental design improved. A third reversal is possible.
For AI builders allocating compute in the next ninety days: if your planning model assumes Chinchilla-optimal ratios without adjusting for inference serving costs, you are likely overbuilding on parameters and underinvesting in data quality. Running your own scaling fits on your target task, not just cross-entropy loss, is the only way to know whether published curves apply to your workload.
Published by Lilian Weng on Lil’Log on June 24, 2026, at https://lilianweng.github.io/posts/2026-06-24-scaling-laws/.