Google retrofits multi-token prediction onto frozen Gemini Nano

A zero-copy MTP head attached to existing Gemini Nano v3 weights delivers 50 percent or more inference speedup on Pixel 9 without retraining the base model.

Alessandro Benigni

PUBLISHED JUN 30, 2026

4 MIN READ

Follow on Google

-1150 MIN AGO

Google retrofits multi-token prediction onto frozen Gemini Nano — featured image for AI Insiders

Google has retrofitted Multi-Token Prediction (MTP) onto deployed, weight-frozen Gemini Nano v3 models running on Pixel 9 and 10 phones, pushing on-device inference speeds up by 50 percent or more on certain tasks without touching the base model’s parameters. The company detailed the architecture on the Google Research blog on June 28, 2026.

The result matters beyond Pixel. The dominant assumption in on-device AI has been that meaningful inference gains require either a smaller model, a retraining pass, or purpose-built silicon. Google’s approach challenges that assumption: existing deployed models can receive a bolt-on efficiency layer without any degradation in output quality or safety alignment, because wrong draft tokens are discarded before they affect the final response.

How standard speculative decoding fails at the edge

Speculative decoding works by having a compact draft model propose several tokens at once, then letting the large model verify them in a single forward pass. When the drafts are correct, you skip multiple expensive generation steps. The problem on mobile is twofold. First, a standalone 128M-parameter drafter competes for limited RAM. Second, a standalone drafter has no visibility into the main model’s internal state, so it predicts from text history alone, missing the semantic context the larger model already computed.

Google’s answer is to discard the standalone drafter entirely. Instead, a lightweight transformer stack, the MTP head, attaches to the final layers of the frozen Gemini Nano v3 backbone. The head takes the main model’s final hidden-state activations and uses them to predict a sequence of future tokens. Because it sees the backbone’s rich representations, its predictions are more accurate than those from a blind external drafter of comparable size.

The zero-copy constraint

RAM on a Pixel phone is not negotiable. Even sharing embedding weights between a drafter and the main model does not solve the problem, because a drafter that processes context independently must build and maintain its own key-value (KV) cache, which Google describes as a “double tax” on memory.

The MTP head sidesteps this by cross-attending directly to the main model’s existing KV cache rather than building its own. The design eliminates drafter prefill latency entirely, since the prompt context is already computed. Google reports memory savings of 130MB per instance compared to a standalone drafter, achieved by removing separate embedding lookup tables, prefill attention variants, and task-specific tuning parameters.

What the numbers show

On Pixel 9 devices, MTP delivers speedups of 50 percent or more compared to standalone drafters of similar parameter counts. For tasks with predictable output structure, such as smart replies, the acceptance rate for draft tokens improves by up to 55 percent. In production workloads including AI Notification Summaries and Proofread, the system correctly predicts an average of nearly two additional tokens per inference pass. Fewer verification cycles also means less time waking the phone’s heavier processors, which Google says reduces energy consumption and extends battery life.

Why frozen-model retrofitting changes the deployment calculus

The standard on-device AI playbook requires mobile teams to choose between capability and speed at model-selection time, then live with that trade-off for the product cycle. Retraining or fine-tuning a new draft model for each downstream task adds engineering overhead and memory pressure. Google’s approach decouples those concerns: the base model handles quality and alignment, and the MTP head handles throughput. Because the output is bit-for-bit identical to what the base model would produce without the head, deploying efficiency updates carries no backward-compatibility risk.

This also has a direct privacy implication. On-device AI keeps sensitive data, notification text, draft messages, typed queries, on the hardware. Faster on-device inference makes the privacy-preserving path more competitive with server-side alternatives on the dimension where it has historically lagged: latency. If MTP heads can be trained quickly on top of frozen foundation models, the cost of shipping an on-device privacy-first feature drops substantially.

Google says it is extending this work to future Pixel devices and investigating parallel decoding and verification leniency, relaxing the strict exact-match requirement for draft tokens in contexts where approximation is acceptable.

For teams shipping on-device features on Android, the practical implication is near-term: a frozen-backbone MTP retrofit is a lower-cost path to faster inference than retraining or model substitution, and Google’s production data on Pixel 9 gives it a verified deployment record to benchmark against.

Published on the Google Research blog on June 28, 2026.

Google retrofits multi-token prediction onto frozen Gemini Nano

The morning brief for people inside the AI industry.

More in Models

Grok 4.5 enters private beta at SpaceX and Tesla on a 1.5T-parameter base

OpenAI previews GPT-5.6 under U.S. government watch

Reward Models Are Too Sensitive, and That Is Why Your RLHF Breaks