Xiaomi hits 1,000 tokens/sec on a trillion-param model using commodity GPUs

MiMo-V2.5-Pro-UltraSpeed reaches speeds no frontier lab has matched at this scale, using only software on standard 8-GPU nodes.

Alessandro Benigni

PUBLISHED JUN 10, 2026

3 MIN READ

Follow on Google

3 DAYS AGO

Xiaomi hits 1,000 tokens/sec on a trillion-param model using commodity GPUs — featured image for AI Insiders

Xiaomi and inference startup TileRT shipped MiMo-V2.5-Pro-UltraSpeed on June 9, breaking 1,000 tokens per second on a 1-trillion-parameter model running on a standard 8-GPU commodity node. According to Decrypt, the speed peaks near 1,200 tokens per second in demos. No custom silicon required.

The context for that number matters. Artificial Analysis data puts GPT-5.5 at 68 tokens per second and Claude Opus 4.6 at roughly 71. Gemini Flash reaches 192. Cerebras, which built a wafer-scale chip specifically to attack GPU bandwidth constraints, hit 969 tokens per second on Meta’s Llama 3.1 405B, a model less than half the parameter count of MiMo-V2.5-Pro. Groq’s custom Language Processing Unit tops out at 300 to 750 tokens per second depending on model. Neither Cerebras nor Groq runs on hardware available to rent from a standard cloud provider tonight.

Two techniques produce the speed gain. FP4 quantization compresses only the mixture-of-experts layers, which account for most of the trillion parameters, down to 4-bit precision while keeping routing and attention layers at full precision. The result is a smaller memory footprint and lower bandwidth pressure with, according to Xiaomi, near-zero quality degradation. The benchmark parity claim is with standard MiMo-V2.5-Pro, not with Claude Opus or GPT-5.5; the announcement does not include independent third-party benchmark validation.

The second technique is DFlash speculative decoding. Standard speculative decoding uses a small draft model to guess several tokens, then verifies them with the large model in parallel. DFlash skips drafting entirely, filling a full block of masked positions in a single forward pass. In coding tasks, the large model accepts 6.3 out of 8 proposed tokens per verification round. A purpose-built inference engine called TileRT keeps the full compute pipeline resident inside the GPU with no per-operator launch gaps. Xiaomi describes the combined approach as “extreme model-system codesign,” and the Decrypt report notes that neither technique alone reaches the 1,000 tokens-per-second threshold.

The pricing structure favors latency-sensitive workloads. UltraSpeed costs three times the standard MiMo-V2.5-Pro rate for roughly ten times the generation speed. At baseline, MiMo-V2.5-Pro runs at approximately $0.43 input and $0.87 output per million tokens, making Claude Opus (at $5 input and $25 output) roughly twelve to twenty-eight times more expensive for equivalent volume. The UltraSpeed premium brings the per-token cost up but keeps it well below frontier alternatives while delivering speeds that change the architecture of what is possible in production.

Fraud detection pipelines, real-time voice agents, browser-driven automation loops, and interactive coding tools all have latency constraints that 60 to 100 tokens per second cannot satisfy. At 1,000 tokens per second, parallel reasoning paths become practical at production scale. The economics shift from “can we afford to run this” toward “how many parallel agents do we want to run.”

The industrial structure behind UltraSpeed is also notable. Most high-throughput open-weights work from China has come from DeepSeek, Alibaba’s Qwen team, or MiniMax. Xiaomi reaching this performance ceiling through a consumer-hardware partnership with TileRT represents a different path, one that relies on inference-engine co-design rather than bespoke silicon or a research-first lab structure.

A limited API trial runs from June 9 to June 23, available on an application basis with priority given to enterprise and professional developers. The FP4-DFlash checkpoint is already open-sourced on Hugging Face for community testing.

Any team building latency-sensitive agent workflows should run their workloads against MiMo-V2.5-Pro-UltraSpeed before the June 23 trial closes; the cost-to-throughput ratio at this parameter scale has no current equivalent on rentable hardware.

Decrypt (decrypt.co), 2026-06-09.

Xiaomi hits 1,000 tokens/sec on a trillion-param model using commodity GPUs

The morning brief for people inside the AI industry.

More in Models

Google ships DiffusionGemma, a 26B open model that generates text 4x faster

The Fable 5 leak's real story is 120,000 characters

Anthropic ships Mythos-class capability to the public via Claude Fable 5