MiniMax teases M3 with sparse attention that runs 15.6x faster at long context

MiniMax, the Chinese AI lab, published a technical report on May 29 covering its M2 series of language models and previewing a new sparse attention mechanism for its upcoming M3 family that delivers up to 15.6x faster decode speeds at long context lengths. VentureBeat covered the release.

The economic argument MiniMax is making is specific. Long-context inference is the dominant cost driver for production agent deployments, because each new turn appends to a context that the model has to attend over from scratch. Quadratic attention cost compounds rapidly: a 100K-token context is 100x more expensive to attend over per token than a 10K-token one. Sparse attention reduces that cost by structurally limiting which tokens each query attends to, trading off some attention coverage for substantially lower compute.

The 15.6x figure is the headline. That’s a meaningful improvement for any workflow that holds long context across many turns: research agents that retain a working document, coding agents that hold a codebase context, and customer-service agents that retain conversation history are all in scope.

The skeptical read: MiniMax is teasing a future model, not shipping one today. The 15.6x figure is on the M3 family’s preview specifications, not on a model anyone outside MiniMax can run yet. Sparse attention is also a well-studied technique with known tradeoffs. Whether MiniMax’s specific implementation preserves the quality characteristics that long-context applications rely on (retrieval accuracy, in-context learning) at the claimed speed is the question that will determine real-world adoption.

The strategic frame is worth naming. We covered DeepSeek’s $10 trillion grand strategy on May 27, framing the Chinese AI infrastructure ambition as a coordinated industrial play. MiniMax’s M3 technical preview fits the same pattern: aggressive efficiency claims targeting the agent-economics constraint that Western flagship models have been slower to address. The Chinese AI lab ecosystem is competing on cost-per-useful-output, not on benchmark headline scores.

For teams running agent workloads with long context, MiniMax’s M3 release is worth tracking for its eventual availability. The architectural direction (sparse attention for production agent economics) is the trend worth understanding regardless of which specific implementation ships first.

Reported by VentureBeat on 2026-05-29.

MiniMax teases M3 with sparse attention that runs 15.6x faster at long context

The morning brief for people inside the AI industry.

More in Models

Anthropic releases Opus 4.8 with effort controls and cheaper fast mode

Microsoft is reportedly building its own AI coding model

NVIDIA's gamma-World adds independently controllable multi-agent rollouts