Mercury 2 hits 1,000 tokens per second. Here is what that buys you.

Inception Labs ships a diffusion-based reasoning model ten times faster than Claude Haiku 4.5 Reasoning, but the speed comes with a precision ceiling and a closed API.

Alessandro Benigni

PUBLISHED JUN 22, 2026

3 MIN READ

Follow on Google

2 HR AGO

Mercury 2 hits 1,000 tokens per second. Here is what that buys you. — featured image for AI Insiders

Inception Labs released Mercury 2 on June 18, claiming roughly 1,000 tokens per second, a rate approximately eleven times faster than Anthropic’s Claude Haiku 4.5 Reasoning at 89 tokens per second and fourteen times faster than OpenAI’s GPT-5 Mini at 71. That single number is the product’s entire thesis: speed is scarce in multi-agent systems, and Mercury 2 is built to solve it.

The architecture is the story. Mercury 2 uses diffusion, the same underlying technique that turns noise into an image in Stable Diffusion, applied to text generation. A standard autoregressive model writes tokens sequentially, each conditioned on every token before it. Mercury 2 instead fills an output block with random placeholder tokens, then denoises the entire block across several parallel passes until the text resolves. The model finishes an entire response at roughly the same wall-clock time a conventional model is still mid-sentence.

Where diffusion pays off in quality is specific. On AIME 2026, a mathematics benchmark built from American Invitational Mathematics Examination problems, Mercury 2 scored 90%. Google’s DiffusionGemma, released around the same time using the same architectural family, scored 69.1% on the same set. Standard Gemma 4, the non-diffusion version, scored 88.3%. Mercury 2 outperforms both. On GPQA, a PhD-level science benchmark, the two diffusion models are near parity: Mercury 2 at 77% against DiffusionGemma at 73.2%. Both trail the strongest frontier autoregressive models on that metric, which is the ceiling the architecture currently accepts.

An outside data point sharpens the picture. Augment Code, an AI coding-agent company, replaced Claude Opus 4.7 on its context-compaction subagent with Mercury 2 and reported an 82% reduction in latency and a 90% reduction in cost with no degradation in output quality, per a joint case study with Inception. Subagent context compaction is precisely the kind of task where diffusion’s advantage compounds: high volume, well-defined scope, latency-sensitive, not requiring frontier reasoning depth.

That framing matters for how to think about Mercury 2 as a product decision. Three workflow categories change meaningfully at 1,000 tokens per second. First, agent loops where a subagent runs dozens of classification or summarization calls per user session: the cost and latency savings convert a previously expensive operation into a cheap one. Second, real-time autocomplete and code suggestion where perceived responsiveness is the product. Third, high-volume classification pipelines at scale where throughput directly determines unit economics. Mercury 2 does not displace a frontier reasoning model on tasks requiring multi-step deduction over ambiguous inputs; it replaces expensive autoregressive inference on tasks where the answer is predictable enough that speed wins.

Inception Labs was founded by Stefano Ermon, a Stanford professor whose research on score-based diffusion techniques underpins much of today’s image generation field. The company raised a $50 million round with backing from Nvidia’s venture arm and individual investors Andrew Ng and Andrej Karpathy.

Two structural limits deserve attention before integration decisions. Mercury 2 is closed-weight and API-only; teams that need local inference, on-premise deployment, or fine-tuning access have no path today. DiffusionGemma, in contrast, is open-weight on Hugging Face, which matters for teams with data-residency requirements or a preference to own the inference stack. The second limit is verification: the 1,000 tokens per second figure comes from Inception’s own benchmarks and the Augment Code case study, which is a production signal but not an independent benchmark. Third-party throughput validation at scale, across varied hardware configurations, has not yet appeared in public.

Google’s own documentation concedes the point directly: its developer guide recommends standard Gemma 4 over DiffusionGemma for applications requiring maximum quality. That concession frames the category honestly. Diffusion LLMs are a real architectural advance for speed-sensitive workloads; they are not yet a drop-in replacement for the highest-stakes reasoning steps in a pipeline.

Teams building multi-agent systems should evaluate Mercury 2 on their highest-volume, lowest-ambiguity subagent calls before the next infrastructure contract review.

Reporting by Decrypt (Jose Antonio Lanz), published June 21, 2026.

Mercury 2 hits 1,000 tokens per second. Here is what that buys you.

The morning brief for people inside the AI industry.

More in Models

The transformer monoculture is over. Here is what replaced it.

Diffusion LLMs Are Not an Interpretability Dead End

NVIDIA's ZPPO Fixes the Hardest-Question Problem in RL Post-Training