Four numbers determine what you can charge for an AI feature: GPU memory bandwidth, peak compute throughput, model active parameter count, and your product’s duty cycle. A technical post published last week on injuly.in works through each of these with enough arithmetic to replace the guesswork most founders substitute for actual cost modeling.

The post’s central claim is blunt: model architecture is nearly irrelevant to per-user cost, except for architectures that are fundamentally different (diffusion models being the example given). For standard transformer-based LLMs, the math is the same whether you are running Gemma, Qwen, or DeepSeek. What moves the number is not which model you pick but how efficiently your inference engine saturates the hardware.

Why memory bandwidth is the binding constraint

On a Blackwell-class GPU such as the NVIDIA B200, compute throughput (4,500 TFLOP/s) outpaces memory bandwidth (8 TB/s) by a factor of 562. That ratio is not a curiosity; it is the core design pressure behind every inference optimization. A GPU that can crunch 562 operations for each byte it loads will sit idle on compute if your workload does not push the batch size large enough to keep both resources busy simultaneously.

The post works out that the optimal batch size to fully saturate a single B200 is 331 concurrent conversations. In practice, VRAM limits you to roughly six simultaneous full-context (200k token) users before memory runs out. The gap between 331 and 6 is where inference engineering lives.

Three optimizations that close that gap

The post identifies three techniques that move the realistic user count toward the theoretical ceiling.

The dollar number

Renting a B200 at $3 per hour and serving 500 users per chip yields approximately $0.006 per user per hour, or $4.32 per user per month in operating costs. That is your pricing floor for the inference line item alone. The post notes this excludes datacenter overhead and assumes a conversational app with a low duty cycle; agentic loops that keep the GPU busy continuously push the number up by an order of magnitude or more.

The post does not discuss multi-GPU load balancing, noting it ran out of napkin space at that point. The math holds for single-GPU deployments and single-model workloads; production clusters add complexity that changes the equations.

What this means for builders pricing an AI feature

The takeaway is not a specific price point but a method. Before you set a subscription tier, pull two numbers from the GPU spec sheet (bandwidth and throughput), confirm your median conversation length from logs, and estimate your product’s duty cycle. Those four inputs yield a defensible floor. Founders who skip this step and set pricing by comparison to competitors are often surprised when usage patterns shift, specifically when agents replace users and duty cycles climb toward 100 percent.

Source: injuly.in, technical post “Inference cost at scale with napkin math,” published June 14, 2026.