Every token you save is money you do not spend. Alex Nichol, an independent researcher writing at blog.aqnichol.com, published a technical post on June 10 showing that provably optimal tokenization is achievable in practice despite being theoretically intractable, at least for specific problem settings.
Tokenization is the preprocessing step that converts raw text into the integer sequences a model actually trains and runs on. The dominant method today is byte-pair encoding (BPE), a greedy compression algorithm that has been in use for decades. BPE is fast and good enough, but it is not optimal. It makes locally sensible choices that do not guarantee the shortest possible token sequence for a given vocabulary size.
Nichol’s approach borrows from integer linear programming (ILP). Building on a formulation introduced by Tempus et al. in a recent paper, he represents a dataset’s entire tokenization as a set of integer variables and frames vocabulary selection as minimizing the total token count subject to a fixed vocabulary budget. The twist he adds is cutting-plane iteration: starting from a relaxed continuous LP, he iteratively adds valid constraints, called cuts, until the LP solution becomes fully integral and therefore optimal. This is the same technique used by solvers for the Traveling Salesman Problem, one of the most studied combinatorial optimization problems in computer science.
The practical results are modest but real. Nichol demonstrated a provably optimal tokenizer for a vocabulary size of 512 on the text of Pride and Prejudice. The algorithm converged in roughly a dozen iterations over about a day of compute on a Mac Studio. Current BPE implementations are already within about 1% of optimal in many settings, so the gap Nichol is closing is narrow. Still, the direction matters more than the margin.
Nichol is explicit about the caveats. The optimality guarantee holds only within the pretokenization constraint (splitting text into words before running the ILP), which his current formulation inherits from Tempus et al. A tokenizer optimal on training data may also not generalize as well to held-out text. And for larger vocabulary sizes, his cut families ran out before converging, meaning more mathematical work is needed before the method scales to production vocabularies of 50,000 or 100,000 tokens. He notes that a slightly larger vocabulary can compensate for a suboptimal tokenizer anyway, which reduces the practical urgency.
None of that undercuts the significance of treating tokenization as an optimization target rather than a fixed infrastructure choice. Most teams set their tokenizer once, often by copying GPT’s BPE configuration, and never revisit it. The choice sits upstream of every downstream cost: training FLOPs, inference latency, context utilization, and per-API-call billing. A tokenizer that represents the same corpus in 5% fewer tokens produces a 5% cost reduction across every operation that touches those tokens, without any model change.
This piece lands in the same week that the broader efficiency thread in AI infrastructure produced multiple advances attacking inference cost from other angles, including KV-cache compression and reduced-precision activations. The tokenizer is the one lever upstream of all of them. Getting it right before training starts is cheaper than compensating with architecture changes afterward.
Nichol used OpenAI’s Codex throughout the project to propose and implement cut families, noting that Codex autonomously discovered “cycle constraints,” the most effective cutting-plane family he found. The implementation is available on GitHub under the repository tokenizer_lp.
Teams building or fine-tuning models on domain-specific corpora, where the general-purpose BPE vocabulary fits poorly, have the most to gain from revisiting tokenization now. Nichol’s work gives a principled framework for measuring how far your current tokenizer is from optimal, even if closing that gap entirely remains compute-intensive.
Source: Alex Nichol, blog.aqnichol.com, published June 10, 2026.