For several years, data curation was treated as one of the highest-leverage decisions in pretraining. The FineWeb dataset from Hugging Face, the Dolma corpus from AI2, and the filtering recipes behind models like LLaMA 3 and Mistral all rested on the same premise: more selective data pipelines produce better models. A preprint submitted to arXiv by Christopher Mohri on May 19, 2026 challenges that premise in a specific but important regime.

The paper’s central finding is clean: when compute is abundant and raw data is scarce, the optimal data filter is no data filter at all. Scaling studies conducted by Mohri target what the paper calls the high-compute, data-scarce regime, where a lab has more FLOP budget than it has high-quality tokens to spend those FLOPs on. In that regime, large-parameter models not only tolerate low-quality text and distractor data but appear to benefit from it.

This result inverts the logic behind projects like DCLM (DataComp for Language Models), where Allen Institute for AI and collaborators demonstrated that aggressive quality filtering improved benchmark performance for a given token and compute budget. It also pushes back against the FineWeb scaling analysis from Hugging Face, which showed curated CommonCrawl subsets outperforming unfiltered crawls. Both of those results, the arXiv paper implies, were measured in the wrong regime: one with constrained compute and plentiful filtered tokens.

The intuition behind the reversal is consistent with Rich Sutton’s 2019 “Bitter Lesson” argument, which holds that scale consistently defeats hand-crafted inductive biases. Data filtering is a form of human-applied inductive bias. If a model is large enough and trained long enough, it may learn to discount low-quality tokens automatically, rendering manual pre-filtering unnecessary or even harmful by artificially narrowing the distribution.

The regime-dependence of this finding deserves emphasis. Mohri’s experiments target a specific corner of the training space: large parameter count, high FLOPs, limited high-quality token supply. This describes frontier labs running multi-trillion-token training runs where the curated web has already been exhausted. It does not obviously describe a mid-size lab training a 7B or 30B model on a standard academic budget, where filtered datasets still appear to win.

The study does not yet have independent replication, and the arXiv preprint has not been peer-reviewed. The paper’s own framing acknowledges that “apparently common belief” in quality filtering is what it is testing. The experimental conditions, including which model sizes were evaluated and which benchmarks were used to measure benefit from poor data, are details that matter for interpreting the result and are not visible in the abstract alone.

For operators managing data pipeline budgets, the practical signal is narrower than the headline implies. If your training run is large enough to already be scraping the long tail of the web, or if you are considering synthetic data augmentation to fill compute headroom, the cost of running an aggressive quality filter may not be justified. If you are training models under 30B parameters with a well-curated dataset and a standard compute envelope, this paper does not change your calculus yet.

The result does raise a pressure point for teams that have built expensive data-cleaning infrastructure: the value of that infrastructure may decay as compute scales up. Labs approaching the frontier should run their own ablations against unfiltered baselines before assuming curation still earns its keep.

Reported by arXiv preprint (arXiv:2605.19407, submitted by Christopher Mohri) on 2026-05-19.