By 2031, the largest models labs can feasibly serve may reach 1.4 quadrillion total parameters. That figure comes from a June 22 LessWrong analysis by Vladimir Nesov, who works backward from memory bandwidth physics to project what AI hardware can realistically support each year. The number is striking. What matters more is the constraint behind it, and where it breaks down.
Nesov’s core argument starts with high-bandwidth memory. Reading model weights and key-value cache from HBM takes time, and a target of roughly 80 tokens per second (achievable with speculative decoding) sets a ceiling on how many pipeline stages you can chain together. More stages mean more total parameters you can serve, but only up to the point where you are reading more than half of available HBM in a single forward pass. That physics bound, not algorithmic ambition, determines the maximum model size in any given year.
The projected maximums are not abstractions. Nesov estimates 1.3 trillion parameters on 2023-era H100 servers, 27 trillion on 2025-era GB200 Oberon racks, and 442 trillion on 2028-era Rubin Ultra Kyber racks. These are the author’s estimates with explicitly stated assumptions, including roughly 40 percent FLOP/s utilization during three months of pretraining and sparsity ratios that climb from 8x in earlier years to 30x by 2028. Labs that are modeling their own scaling projections should stress-test those assumptions, not take the numbers as given.
The more significant finding is what happens in 2028. Before that year, Nesov argues, inference serving capacity is the binding constraint: you could pretrain a bigger model, but you could not serve it efficiently. Starting in 2028, that flips. Pretraining compute becomes the limiting factor. The hardware can serve models far larger than anyone will actually train, because training them first requires compute resources that are not projected to be available at the needed scale.
There is a second constraint stacking on top of compute. Nesov estimates roughly 200 trillion unique pretraining tokens exist across all sources. A 2031-scale model, at the parameter counts his analysis projects, would need to train for close to four epochs of repeated data to reach compute-optimal coverage. By his estimate, the 2027 model already needs 32 percent more active parameters than a compute-optimal model on fresh data would require, because of a 1.75x data shortfall. By 2031, the active parameter count needs to be four times what unlimited data would predict. Labs are not scaling into abundance. They are scaling to compensate for scarcity.
What this means for builders and buyers is concrete. “Bigger every year” has been a reliable heuristic for roughly a decade. Nesov’s analysis suggests it holds through 2027 on the inference axis, and then the constraint shifts to pretraining economics and data availability. A lab that announces a dramatically larger model in 2028 or 2029 is not necessarily evidence that the hardware constraint has been solved; it may mean the lab has access to outsized pretraining compute, or it may mean the model’s effective parameter count (active at inference) is smaller than the headline figure suggests, with sparsity doing the work.
Roadmap claims from any lab should be read against what generation of infrastructure they actually have deployed. A model announced for 2028 serving on Rubin Ultra Kyber-class hardware faces a different feasibility envelope than one intended for systems shipping a year earlier. Nesov’s framework gives buyers a way to ask the right question: what is the inference hardware, what is the sparsity ratio, and is the pretraining compute available to fill that envelope?
The analysis carries honest uncertainty. Chip roadmaps slip. Sparsity assumptions may prove too aggressive or too conservative. The 200-trillion-token data ceiling is an estimate, not a measurement, and synthetic data and multimodal sources could shift it. Nesov labels these as estimates throughout, and the appropriate read is a range of plausible outcomes, not a point forecast.
Teams evaluating multi-year model access contracts or building infrastructure assumptions around frontier model availability in 2028 and beyond should track which hardware generations labs are actually deploying, not just what capabilities they announce.
Source: “Model Size Scaling in 2023-2031” by Vladimir Nesov, published on LessWrong, June 22, 2026.