Apple Recycles the Tokens Diffusion Language Models Throw Away

Residual Context Diffusion reuses discarded token data to lift diffusion model accuracy by 5 to 10 points, nearly doubling scores on the hardest math benchmark.

Alessandro Benigni

PUBLISHED JUL 3, 2026

3 MIN READ

Follow on Google

1 HR AGO

Apple Recycles the Tokens Diffusion Language Models Throw Away — featured image for AI Insiders

Apple and University of California, Berkeley researchers built a module that recovers computation diffusion language models currently throw away. The module, called Residual Context Diffusion (RCD), lifted accuracy by 5 to 10 points across benchmarks and nearly doubled scores on the hardest reasoning tasks tested, according to a paper posted on Apple’s machine learning research site.

Diffusion language models denoise an entire block of text at once rather than generating one token at a time the way GPT-style autoregressive models do. Leading block-wise dLLMs rely on a remasking step: at each pass, the model keeps only the tokens it is most confident about and reworks everything else in a later pass. That reworking wastes computation, because the tokens the model sets aside still carry information the next pass could use.

RCD intercepts that information before it disappears. It converts low-confidence token representations into what the researchers term contextual residuals, then reinjects them so the model carries that memory into its next denoising round. Instead of starting each pass cold, the model builds on a compressed trace of what it already worked out.

Training a model to use those residuals normally means backpropagating through several denoising steps at once, a process that burns through memory on large models. The researchers sidestep that cost with a training approach split into two decoupled stages, avoiding full backpropagation across the chain. A standard dLLM can be converted to the RCD architecture with roughly 1 billion tokens of additional training, a small fraction of the compute spent pretraining a frontier model from scratch.

The team tested RCD on two model families: SDAR, tuned for long chain-of-thought reasoning, and LLaDA, tuned for shorter instruction-following tasks. Both gained 5 to 10 accuracy points across the benchmark suite while adding little extra compute. On AIME, a competition-math benchmark that stresses multi-step reasoning, RCD nearly doubled baseline accuracy and matched that higher accuracy using 4 to 5 times fewer denoising steps, a proxy for inference cost.

The paper compares RCD only against the dLLM baselines it modifies, not against production autoregressive systems such as GPT-5 or Claude. Diffusion language models remain a minority bet in deployed AI. Apple, Google with its experimental Gemini Diffusion, and a handful of open research groups are the most visible backers, wagering that generating tokens in parallel can eventually undercut the per-token cost of autoregressive inference. RCD does not settle that wager. It closes some of the accuracy gap that made dLLMs a research curiosity rather than a production choice, without adding meaningful compute cost.

Teams evaluating diffusion models for latency-sensitive products should treat RCD as a reason to re-run that evaluation, not a reason to switch: the reported gains come from Apple and Berkeley’s own benchmark suite, and independent replication against matched autoregressive baselines has not yet appeared.

Reported by Apple in research published on its machine learning research site.

Apple Recycles the Tokens Diffusion Language Models Throw Away

The morning brief for people inside the AI industry.

More in Models

ByteDance Seed's Model Card Puts Evaluation Design Before Benchmarks

Poolside's Laguna XS 2.1 lifts SWE-bench score, loosens its license

Meta's unreleased Watermelon model reportedly closes gap with GPT-5.5