Moondream published an engineering breakdown on June 4 explaining why GPUs running AI models spend part of every generation loop sitting idle, and how its Photon inference engine removes that waste. The company says the fix, called pipelined decoding, delivers up to 35 percent higher decode throughput without new hardware. That is a claim about scheduling, not about a bigger or smarter model.
The underlying problem is structural. Language and vision-language models generate one token at a time, and each token depends on the one before it: you cannot compute the third token before the second exists. Producing a token requires a round trip between CPU and GPU. The GPU runs the billions of arithmetic operations that produce the token. The CPU picks which requests run next, prepares metadata, and records the sampled result. Moondream calls the resulting stall a GPU bubble: the accelerator finishes its math and then waits for the CPU’s fixed-cost housekeeping before it can start the next token.
Pipelined decoding overlaps those two jobs instead of running them in sequence. The GPU launches the next token’s computation while the CPU is still finalizing bookkeeping for the previous one, because the newly sampled token never has to leave GPU memory to feed the next step. Moondream’s post frames the insight plainly: not waiting on the CPU copy is the change that removes the bubble.
Three mechanisms make that safe, according to the post. Ping-pong slots give each overlapping step its own buffer set, so two steps in flight cannot overwrite each other’s results. A “forward now, sample later” ordering lets the next token’s computation start immediately while sampling, which depends on the prior step’s committed state, waits until that state is ready. And a scheme Moondream calls zombie handling lets a finished request’s stray in-flight step complete harmlessly rather than requiring cancellation logic mid-flight.
Moondream backs the claim with a cost model and matching benchmarks rather than a single marketing number. On an Nvidia RTX 3090 at 32 concurrent streams, the company measured an 11.6 percent speedup, close to its predicted 11.1 percent. On a B200, the same configuration produced a 35.4 percent measured gain against a 39.1 percent prediction. The pattern the company highlights: the win grows with GPU speed, because bookkeeping cost stays roughly fixed while the GPU’s per-token compute time keeps shrinking on faster silicon.
The company also names the technique’s cost. Launching a step before the prior one commits means a request that finishes mid-flight can leave one wasted forward pass in the pipeline, a “zombie” step. Moondream estimates this tax at about 1 percent of throughput on a single stream and says it nearly disappears once requests are batched, since the wasted row rides along in a step that is already paying to move the full weight set through memory.
The company frames pipelined decoding as one of many stacked optimizations rather than the whole story behind Photon’s roughly 33-millisecond inference time on a B200, and it disclosed the technique publicly rather than keeping it proprietary, which invites scrutiny that a bare benchmark table would not. The post does not include independent, third-party benchmark results: every number comes from Moondream’s own measurements against its own predictions.
For teams building latency-sensitive inference stacks, the more durable point is not the specific 35 percent figure but the mechanism: as GPUs and models both get faster, CPU-side scheduling overhead becomes a larger share of total decode time unless it is explicitly hidden. Any team running high-throughput serving on newer accelerators should check whether their own scheduler still blocks the GPU on commit-and-launch bookkeeping, because that gap only grows as hardware improves.
Reported by Moondream on June 4, 2026.