AI hardware is a memory problem, not a compute problem

The dominant framing for AI hardware competition is wrong. Chips are not racing to deliver more floating-point operations per second. They are racing to move data fast enough that the compute they already have does not sit idle. Category VC published an analysis on May 26 arguing that the AI hardware market is, at its core, a stack of memory problems, and that the companies who understand this will outcompete the ones still optimizing for raw throughput.

The mechanics are specific. Running a large language model at inference time requires loading and writing the KV cache, the temporary store of attention keys and values that grows with context length. For a model serving long-context requests or multi-turn agentic sessions, the KV cache can consume dozens of gigabytes per concurrent user. At that scale, the bottleneck is not how many operations the chip can perform. It is how fast the chip can read from and write to memory, and how much memory it can address without spilling to slower tiers.

HBM (high-bandwidth memory) stacks are the current answer. Nvidia’s H100 carries 80 GB of HBM3 at roughly 3.35 terabytes per second of memory bandwidth. The Blackwell generation pushes this further. But model architecture is not standing still while hardware refreshes on its 18-to-24-month cycle. Mixture-of-Experts architectures like DeepSeek V4 activate only a fraction of parameters per token, which restructures the memory access pattern. Long-context models require attention bandwidth that scales quadratically with sequence length. Agentic systems running multi-step tool calls hold context across many turns, keeping large KV caches live for minutes rather than milliseconds. Each of these shifts changes the shape of the bottleneck before the silicon designed for the previous shape reaches volume production.

This is the structural mismatch that the hardware industry has not yet resolved. Model architecture cycles now run at roughly three to six months. A lab can ship a new architecture that fundamentally changes memory access patterns faster than a chip vendor can respond in silicon. Nvidia, Cerebras, Groq, AMD, and the hyperscaler custom programs (Google’s TPU, AWS Trainium, Microsoft Maia) are all, to varying degrees, betting on a specific bottleneck shape when they tape out a chip. If the bottleneck shifts between tapeout and volume deployment, the chip arrives partially optimized for a problem that has moved.

Cerebras and Groq built architectures specifically around memory locality and fast memory access, trading off flexibility for throughput on specific workloads. That bet looks reasonable when the workload is stable. It looks fragile when the workload is MoE inference one quarter and long-context agentic orchestration the next. AMD’s MI300X ships with 192 GB of HBM3, a capacity bet rather than a bandwidth bet, which positions it differently but does not escape the same tradeoff.

The SpaceX and Anthropic data center contract we covered last week provides context here. Anthropic committed to roughly 500 megawatts of capacity with a 90-day cancellation clause, at a premium rate. The short exit window is not standard enterprise procurement language. It is an explicit acknowledgment that the workload assumptions baked into today’s infrastructure choices may not hold at deployment time. Anthropic, which is building the models, structured its own compute commitment to preserve optionality. That is a signal worth reading carefully.

The Category VC analysis frames this as a market positioning question for hardware companies, but the consequence flows downstream to enterprise infrastructure decisions being made right now. A team choosing between GPU SKUs or negotiating a multi-year reserved instance commitment is implicitly betting on which memory constraint will bind their production workload in 2027. HBM capacity, bandwidth, cache hierarchy, and interconnect architecture are not interchangeable. Optimizing for peak efficiency on today’s bottleneck is a coherent strategy only if the bottleneck does not move.

Enterprise infrastructure teams locking in compute contracts in 2026 should weight optionality over peak efficiency: the binding memory constraint at production deployment time will likely differ from the one driving vendor selection today.

Published on Category VC’s writing blog on 2026-05-26.

AI hardware is a memory problem, not a compute problem

More in Opinion

AI cannot save prediction markets because the bottleneck is demand, not supply

DeepSeek's grand strategy is a $10T Chinese AI hardware ecosystem

Gemini 3.5 Flash wins its speed tier, stops short of the frontier