A lightweight retriever released on GitHub can run DeepSeek-V4 long-context inference while keeping only 10 to 15 percent of the KV cache on GPU. The rest is offloaded to CPU or disk. Benchmarks from the release show the approach matches or beats full-attention on RULER, LongMemEval, and LongBench V2 across contexts from 64K to 500K tokens.
The technique, called FlashMemory, targets DeepSeek-V4’s compressed sparse attention (CSA) mechanism. Every 64 decode steps, a small predictor scores all cached key-value chunks and keeps only the top scorers resident on GPU. For million-token contexts, the repository documentation estimates this is the difference between needing eight GPUs and needing one.
The release is MIT-licensed with trained weights on Hugging Face. The production swap engine remains internal; what shipped is the retriever weights and a toy inference loop that makes the control flow concrete for builders evaluating integration. Teams running DeepSeek-V4 at long context should benchmark their workloads against the released weights before their next infrastructure procurement cycle.
Liberty Wing on GitHub (github.com/libertywing/FlashMemory-Deepseek-V4), published 2026-06-09.