Kernel fusion is where PyTorch inference speed actually hides

A Hugging Face engineering walkthrough shows how fusing MLP ops eliminates costly memory round trips that raw compute numbers never reveal.

Alessandro Benigni

PUBLISHED JUN 13, 2026

1 MIN READ

Follow on Google

-879 MIN AGO

Kernel fusion is where PyTorch inference speed actually hides — featured image for AI Insiders

Memory bandwidth, not peak compute, limits throughput for many neural network layers. A Hugging Face engineering post published June 10 makes that concrete by tracing a GeGLU MLP through three execution modes: eager PyTorch, torch.compile, and a hand-tuned Liger kernel.

The key finding is structural. In eager mode, a three-layer MLP produces five GPU kernels. Two of them, GeLU and the elementwise multiply, write a 50 MB intermediate tensor to high-bandwidth memory and immediately read it back. Fusing those two ops into one Triton kernel eliminates that round trip entirely. The intermediate stays in registers. Bandwidth cost disappears.

torch.compile produces the same fusion automatically. The Liger kernel from the Hugging Face Hub bakes it in without compile latency or Dynamo retracing costs when input shapes change.

The post is part two of a profiling series aimed at engineers who want to read PyTorch profiler traces rather than treat the framework as a black box. For teams tuning inference at scale, the takeaway is straightforward: before adding hardware, check whether intermediate activations are making unnecessary trips through global memory.

Hugging Face engineering blog (huggingface.co/blog), published June 10, 2026.

Kernel fusion is where PyTorch inference speed actually hides

The morning brief for people inside the AI industry.

More in Wire

NVIDIA ships open-source scanner for agent skill supply-chain risk

Cursor's Bugbot is 3x faster, 22% cheaper, and catches more bugs

OpenAI's Codex helped an EHT scientist simulate black hole plasma