FlashMemory cuts DeepSeek-V4 KV-cache to 10-15% with no accuracy loss

A new open-source retriever predicts which cache chunks matter and evicts the rest, slashing long-context GPU memory 6-10x.

Alessandro Benigni

PUBLISHED JUN 11, 2026

1 MIN READ

Follow on Google

YESTERDAY

FlashMemory cuts DeepSeek-V4 KV-cache to 10-15% with no accuracy loss — featured image for AI Insiders

A lightweight retriever released on GitHub can run DeepSeek-V4 long-context inference while keeping only 10 to 15 percent of the KV cache on GPU. The rest is offloaded to CPU or disk. Benchmarks from the release show the approach matches or beats full-attention on RULER, LongMemEval, and LongBench V2 across contexts from 64K to 500K tokens.

The technique, called FlashMemory, targets DeepSeek-V4’s compressed sparse attention (CSA) mechanism. Every 64 decode steps, a small predictor scores all cached key-value chunks and keeps only the top scorers resident on GPU. For million-token contexts, the repository documentation estimates this is the difference between needing eight GPUs and needing one.

The release is MIT-licensed with trained weights on Hugging Face. The production swap engine remains internal; what shipped is the retriever weights and a toy inference loop that makes the control flow concrete for builders evaluating integration. Teams running DeepSeek-V4 at long context should benchmark their workloads against the released weights before their next infrastructure procurement cycle.

Liberty Wing on GitHub (github.com/libertywing/FlashMemory-Deepseek-V4), published 2026-06-09.

FlashMemory cuts DeepSeek-V4 KV-cache to 10-15% with no accuracy loss

The morning brief for people inside the AI industry.

More in Wire

Kernel fusion is where PyTorch inference speed actually hides

NVIDIA ships open-source scanner for agent skill supply-chain risk

Cursor's Bugbot is 3x faster, 22% cheaper, and catches more bugs