The napkin math that turns a GPU spec sheet into per-user cost

A technical post on injuly.in shows how GPU bandwidth, batch size, and KV-cache limits set the real per-user price floor for any AI feature.

Alessandro Benigni

PUBLISHED JUN 16, 2026

4 MIN READ

Follow on Google

-1011 MIN AGO

$The napkin math that turns a GPU spec sheet into per-user cost — featured image for AI Insiders$

Four numbers determine what you can charge for an AI feature: GPU memory bandwidth, peak compute throughput, model active parameter count, and your product’s duty cycle. A technical post published last week on injuly.in works through each of these with enough arithmetic to replace the guesswork most founders substitute for actual cost modeling.

The post’s central claim is blunt: model architecture is nearly irrelevant to per-user cost, except for architectures that are fundamentally different (diffusion models being the example given). For standard transformer-based LLMs, the math is the same whether you are running Gemma, Qwen, or DeepSeek. What moves the number is not which model you pick but how efficiently your inference engine saturates the hardware.

Why memory bandwidth is the binding constraint

On a Blackwell-class GPU such as the NVIDIA B200, compute throughput (4,500 TFLOP/s) outpaces memory bandwidth (8 TB/s) by a factor of 562. That ratio is not a curiosity; it is the core design pressure behind every inference optimization. A GPU that can crunch 562 operations for each byte it loads will sit idle on compute if your workload does not push the batch size large enough to keep both resources busy simultaneously.

The post works out that the optimal batch size to fully saturate a single B200 is 331 concurrent conversations. In practice, VRAM limits you to roughly six simultaneous full-context (200k token) users before memory runs out. The gap between 331 and 6 is where inference engineering lives.

Three optimizations that close that gap

The post identifies three techniques that move the realistic user count toward the theoretical ceiling.

KV-cache: By storing the key-value matrices computed for each token, inference engines avoid reprocessing the entire conversation history on every forward pass. Without it, generating one token at 200k context requires 26 trillion floating-point operations. With it, the same step drops to roughly 52 million. The compute profile flips from compute-bound to memory-bound, which is exactly what you want.
Grouped-Query Attention (GQA): A standard 32B model at 200k context would need 210 GB of VRAM just for the KV cache, far more than a single GPU holds. GQA shares key-value heads across multiple query heads, cutting that footprint by about 8x to roughly 26 GB per active conversation. That is what makes 40-60 concurrent users per chip viable at realistic context lengths.
PagedAttention (vLLM’s implementation): Most users never consume their full advertised context window. Median conversation length in chat-style products tends to fall between 4k and 40k tokens, not 200k. PagedAttention allocates KV cache in pages matched to actual usage, letting cold or abandoned sessions release memory so other users can proceed. The post puts realistic concurrent capacity at 300 to 800 users per Blackwell chip for a standard ChatGPT-style app, where users spend roughly 80 percent of session time reading rather than prompting.

The dollar number

Renting a B200 at $3 per hour and serving 500 users per chip yields approximately $0.006 per user per hour, or $4.32 per user per month in operating costs. That is your pricing floor for the inference line item alone. The post notes this excludes datacenter overhead and assumes a conversational app with a low duty cycle; agentic loops that keep the GPU busy continuously push the number up by an order of magnitude or more.

The post does not discuss multi-GPU load balancing, noting it ran out of napkin space at that point. The math holds for single-GPU deployments and single-model workloads; production clusters add complexity that changes the equations.

What this means for builders pricing an AI feature

The takeaway is not a specific price point but a method. Before you set a subscription tier, pull two numbers from the GPU spec sheet (bandwidth and throughput), confirm your median conversation length from logs, and estimate your product’s duty cycle. Those four inputs yield a defensible floor. Founders who skip this step and set pricing by comparison to competitors are often surprised when usage patterns shift, specifically when agents replace users and duty cycles climb toward 100 percent.

Source: injuly.in, technical post “Inference cost at scale with napkin math,” published June 14, 2026.

The napkin math that turns a GPU spec sheet into per-user cost

The morning brief for people inside the AI industry.

More in Tools

Allen AI ships olmo-eval, a dev-loop eval workbench for LLM builders

Google ships a standard for agent knowledge bases

Debug the data, not the model