Google released DiffusionGemma on June 10, a 26-billion-parameter Mixture of Experts open model that applies diffusion-style generation to text. The headline number is a 4x throughput gain over comparable autoregressive models on dedicated GPUs. Google is explicit that this comes with a quality tradeoff: DiffusionGemma is positioned for speed-critical workloads, not as a replacement for standard Gemma 4.

The architectural break from standard LLMs is the unit of generation. Most language models produce one token at a time, left to right, which leaves a local GPU waiting between steps. DiffusionGemma starts with a canvas of random placeholder tokens and makes multiple forward passes, locking in tokens and using them as context to refine the rest. Each pass processes 256 tokens simultaneously. On a single NVIDIA H100, that produces over 1,000 tokens per second; on a GeForce RTX 5090, over 700.

Bi-directional attention is the architectural unlock that makes this possible. Autoregressive models use causal attention: each token can only see what came before it. DiffusionGemma, because it refines an entire block at once, lets every token attend to every other token in the block. Google notes this creates specific advantages for tasks where future context matters, including code infilling, inline editing, and structured outputs like markdown formatting. Unsloth demonstrated the point by fine-tuning DiffusionGemma to solve Sudoku, a task where each cell depends on future cells and that causal models handle poorly.

The latency economics are worth placing in context. Over the past two weeks, the inference speed problem has attracted three distinct engineering approaches. Xiaomi reported a model running at 1,000 tokens per second using FP4 quantization. KV-cache compression work has targeted the memory bandwidth bottleneck. DiffusionGemma attacks the problem differently: by parallelizing generation itself, it shifts compute pressure from sequential memory reads to the arithmetic throughput that modern GPU architectures are built for. These are not competing solutions; they address different parts of the inference stack.

The quality-for-speed tradeoff is real and Google does not obscure it. The release announcement states directly that DiffusionGemma’s output quality is lower than standard Gemma 4 and recommends deploying Gemma 4 for applications requiring maximum quality. The 4x number applies to dedicated, compute-bound accelerators. Apple Silicon Macs, which are memory-bandwidth-bound during inference, are unlikely to see the same acceleration, per a footnote in the release.

DiffusionGemma activates only 3.8 billion parameters at inference despite its 26B total parameter count, which means it fits within 18GB VRAM when quantized. That puts it on a high-end consumer GPU such as an RTX 4090 or 5090. Google worked with NVIDIA to optimize across the Hopper and Blackwell generations using NVFP4 kernels. Inference is supported through vLLM, MLX, and Hugging Face Transformers; llama.cpp support is listed as arriving soon. Weights are available on Hugging Face under an Apache 2.0 license.

One constraint worth noting: the speed advantage is strongest at low-to-medium batch sizes on a single accelerator. In high-QPS cloud serving, where autoregressive models can batch thousands of concurrent requests to saturate compute, DiffusionGemma’s parallel decoding offers diminishing returns and may carry higher serving costs. The throughput win is most relevant to local inference and low-concurrency deployments, which is also where agent workflows tend to chain many short generations.

For teams building agentic systems that generate many short outputs sequentially, DiffusionGemma is worth benchmarking now. The Apache 2.0 license removes distribution friction, the hardware requirements are reachable on consumer hardware, and the 4x figure is concrete enough to evaluate against current inference costs before committing to a serving architecture.

Google published this announcement on blog.google on June 10, 2026.