Google published Quantization-Aware Training (QAT) checkpoints for Gemma 4 on June 5, compressing the E2B edge model to under 1GB of active memory. The announcement, posted to Google’s developer blog, marks a meaningful quality threshold: on-device inference is no longer a capability compromise for developers who want production-grade output without a round-trip to a cloud API.
QAT differs from the standard approach in one consequential way. Post-training quantization shrinks a finished model after the fact, and quality degrades because the weights were never trained to tolerate that compression. QAT bakes the quantization simulation into the training loop itself, so the model learns to be small from the start. Google says its QAT results beat standard post-training quantization baselines on quality benchmarks, though the company conducted those evaluations internally and has not published third-party verification.
The memory reduction is roughly 4x compared to standard FP16 checkpoints, with quality degradation measured in the low single digits on Google’s own benchmarks. For the E2B and E4B edge variants, Google built a mobile-specific quantization schema rather than adapting the generic Q4_0 format. Four design choices drive the savings: pre-calculated static activations that reduce on-chip math overhead, channel-wise quantization aligned to mobile accelerator architecture, 2-bit compression targeted at token-generation layers while keeping reasoning layers at higher precision, and compression of the embedding and KV cache to shrink active memory during long conversations.
The practical implication for builders is direct. A text-only Gemma 4 E2B model, with audio and vision encoders stripped out, runs in under 1GB. That fits on hardware that could not run any frontier-adjacent model six months ago. Checkpoints are available now on Hugging Face in GGUF format for llama.cpp, as compressed tensors for vLLM, and through Google’s LiteRT-LM runtime for Android deployment. MLX support covers Apple Silicon.
The timing sits inside a larger convergence. Apple announced Multi-AI Extensions at WWDC this week, enabling third-party models to slot into system-level inference pipelines alongside on-device Siri processing. Perplexity shipped a hybrid local-cloud orchestrator that routes queries between on-device and server inference based on complexity. Three separate bets, from three separate companies, are resolving to the same architectural position: cloud-only AI is one option, not the only viable shape for production deployment.
Google’s QAT release is the infrastructure layer that makes the other two bets more credible. Apple’s Multi-AI Extensions need capable small models to route to. Perplexity’s hybrid orchestrator needs on-device quality close enough to cloud quality that the routing decision is worth making. At sub-1GB with single-digit benchmark degradation, Gemma 4 QAT is the first openly published checkpoint that clears both bars simultaneously.
The skepticism worth holding: all quality measurements in this announcement come from Google. The release does not include independent evaluations or side-by-side comparisons against quantized versions of competing open models at similar size points. Developers planning to ship production features on the basis of these quality claims should run their own task-specific evals before committing.
Teams building on-device features for Android or planning to integrate with Apple’s Multi-AI Extensions should benchmark Gemma 4 QAT against their specific workloads this week; the quality-to-size ratio at 1GB has not been publicly matched by any openly licensed checkpoint, and toolchain support across llama.cpp, Ollama, LM Studio, and vLLM is available from day one.
Google Developer Blog (blog.google), published June 5, 2026.