DFlash delivers 4.3x throughput gains on Qwen 3.5 serving

Z Lab, Modal, and SGLang shipped a new speculative decoding method that beats both baseline inference and native multi-token prediction across every benchmark they ran.

Alessandro Benigni

PUBLISHED JUN 17, 2026

3 MIN READ

Follow on Google

-516 MIN AGO

DFlash delivers 4.3x throughput gains on Qwen 3.5 serving — featured image for AI Insiders

At concurrency 1 on the HumanEval coding benchmark, the new DFlash draft model for Qwen 3.5 397B-A17B produces more than 4.3 times the token throughput of vanilla baseline inference and 1.5 times the throughput of native MTP speculation. Those numbers, published June 15 on the LMSYS blog, come from a joint release by Z Lab, Modal, and the SGLang team running on 8xB200 hardware via Modal.

Speculative decoding is a technique that uses a small, fast draft model to propose several tokens at once, which the main target model then verifies in a single parallel pass. No output quality is lost; the speedup comes entirely from batching what was previously a sequential, one-token-at-a-time process. Multi-token prediction (MTP), the native variant baked into models like DeepSeek-V4 and Gemma 4, does the same thing but still generates draft tokens autoregressively, one after another, inside the draft model itself.

DFlash’s core departure is replacing that sequential draft generation with block diffusion. The draft model generates an entire block of tokens in one forward pass rather than token by token, which maps far better onto how GPUs and TPUs actually run. A 5-layer DFlash drafter generating 16 tokens carries lower drafting latency than a single-layer EAGLE-3 drafter producing 4.

The second piece of the architecture is KV injection. Previous methods like EAGLE pass target model representations only to the input layer of the draft model; that signal degrades in deeper draft networks. DFlash instead injects those hidden representations directly into the KV cache of every draft layer, keeping the drafter tightly conditioned on the target model’s context throughout. The combination lifts both sides of the speculative decoding speedup equation: diffusion drafting cuts the draft cost, while KV injection raises acceptance length.

The ablation data from Z Lab’s R&D phase makes both effects legible. On HumanEval with a Qwen 3-4B target, a 5-layer EAGLE-3 drafter achieves an acceptance length of 4.3 tokens with a 2.2x end-to-end speedup. The same 5-layer DFlash drafter hits 4.0 acceptance length but a 3.2x speedup because the parallel drafting is faster even at a marginally lower acceptance rate. When KV injection is added on top, acceptance lengths on GSM8K reach 4.8 tokens, though end-to-end speedup is pulled back by the autoregressive drafting cost in that configuration.

SGLang’s Spec V2 engine compounds these gains through a second mechanism called the overlap scheduler. The V1 engine stalled GPU execution while the host CPU handled batch cleanup (stop token detection, metadata updates) and KV cache allocation for the next batch. The V2 engine overlaps those host tasks with live GPU work. On a single B200 running Qwen 3-8B at concurrency 32, V2 improved throughput by more than 33 percent, from roughly 11.4 thousand tokens per second to roughly 15.3 thousand tokens per second, according to the LMSYS post.

A few caveats matter here. All benchmark figures in the post are self-reported by the teams that built the system. The hardware setup (8xB200 on Modal) is high-end and the gains will not translate linearly to cheaper or older GPU configurations. The headline 4.3x figure applies at concurrency 1, a low-load scenario where speculative decoding gains are largest; throughput advantages compress at higher concurrencies. The teams note that the DFlash model for Qwen 3.5 397B-A17B beats native MTP across concurrencies from 1 to 32, but they do not disclose by how much at the high end.

For serving teams running large MoE models in production, the practical signal is that the open-source inference stack is now competitive on throughput with approaches that require proprietary MTP modules. The DFlash draft models are available on Hugging Face across the Z Lab, Modal, and LMSYS organizations, and the SGLang integration ships as the default Spec V2 engine. Teams currently sizing GPU budgets for 400B-class model serving should run DFlash against their workload before finalizing capacity estimates.

Published by the Z Lab, Modal, and SGLang teams on the LMSYS blog on June 15, 2026, at lmsys.org/blog/2026-06-15-next-generation-speculative-decoding-dflash-v2/.

DFlash delivers 4.3x throughput gains on Qwen 3.5 serving

The morning brief for people inside the AI industry.

More in Tools

A 100x cheaper eval judge that matches Claude Opus on chatbot traces

A new document format wants to fix how enterprises feed files to AI

Facebook Turns Its Search Bar Into a Conversational AI Engine