Memory bandwidth, not peak compute, limits throughput for many neural network layers. A Hugging Face engineering post published June 10 makes that concrete by tracing a GeGLU MLP through three execution modes: eager PyTorch, torch.compile, and a hand-tuned Liger kernel.

The key finding is structural. In eager mode, a three-layer MLP produces five GPU kernels. Two of them, GeLU and the elementwise multiply, write a 50 MB intermediate tensor to high-bandwidth memory and immediately read it back. Fusing those two ops into one Triton kernel eliminates that round trip entirely. The intermediate stays in registers. Bandwidth cost disappears.

torch.compile produces the same fusion automatically. The Liger kernel from the Hugging Face Hub bakes it in without compile latency or Dynamo retracing costs when input shapes change.

The post is part two of a profiling series aimed at engineers who want to read PyTorch profiler traces rather than treat the framework as a black box. For teams tuning inference at scale, the takeaway is straightforward: before adding hardware, check whether intermediate activations are making unnecessary trips through global memory.

Hugging Face engineering blog (huggingface.co/blog), published June 10, 2026.