The transformer monoculture is over. Here is what replaced it.

Modern frontier models pack half a dozen attention variants, MoE routing, and multi-GPU comms into a single forward pass. "LLM" now hides a lot.

Alessandro Benigni

PUBLISHED JUN 22, 2026

4 MIN READ

Follow on Google

2 HR AGO

The transformer monoculture is over. Here is what replaced it. — featured image for AI Insiders

Pick any two frontier open-weight models released a year apart and diff their architecture diagrams. The gap looks less like an update and more like two different species. That gap is the real story behind today’s LLM benchmark race, and it has direct consequences for anyone reasoning about cost, context length, or behavioral consistency across models.

Ian Barber, writing on his personal blog ianbarber.blog this week, makes the point plainly: the 2017 “attention is all you need” transformer was a clean, regular stack of repeated modules. Llama 3 still has that legible scaffold. Nvidia’s Nemotron 3 Ultra, released for the same era, is visibly different when you place them side by side in Seb Raschka’s LLM architecture gallery.

The attention layer alone now splinters into several distinct variants. Grouped-query attention (GQA) reduces memory at inference by sharing key-value heads across query groups, which is why you can run Llama 3 8B on a consumer GPU. Sliding-window attention constrains each token to attend only within a fixed local window, making long-context processing cheaper. Linear attention approximates the full softmax with a kernel trick that cuts quadratic scaling to linear. Sparse attention drops most token pairs entirely, routing computation only where it is likely to matter. These are not cosmetic choices: each variant changes where the model spends its compute budget, which shapes latency, throughput, and what context lengths are economically viable.

Beyond attention, Mixture-of-Experts (MoE, a technique that activates only a subset of specialized sub-networks per token, rather than the full weight matrix) has now spread from the feed-forward layer to attention blocks and even the residual stream. Vision and audio encoders, once grafted on after training, are now fused into the forward pass itself. And because large models now run across multiple GPUs at inference time, collective communication operations (the coordination traffic between GPUs) appear as explicit nodes inside the model graph, adding synchronization boundaries that affect token latency at scale.

Barber draws a parallel to recommendation systems, which went through the same arc. For a decade, the dominant architecture was a two-tower sparse neural net. It was legible. Then the gap between performance as an optimization and performance as a necessity shrank to nothing, and the architecture exploded in complexity to stay competitive. The baseline became the bottleneck.

The practical consequence for builders is this: you cannot port a system from one model to another and assume equivalent behavior at equivalent throughput. A model using linear attention has a different effective context window behavior than one using standard attention, even if both advertise the same context length. A model routing through MoE activates different expert subsets for different input domains, which can produce inconsistent precision on niche tasks. If you are evaluating two models against a cost-per-token target, the attention variant is a first-order variable, not an implementation detail.

Barber highlights FlexAttention in PyTorch as a principled response to this proliferation. FlexAttention generates fused attention kernels via Triton templates for a whole class of attention operations, designed from the start to be composable and benchmarkable without heavy manual optimization. It lets a team explore variant B before committing to the engineering cost of fully fusing it. The lesson is not that complexity is avoidable. It is that composability is the only property that keeps a research loop from seizing up as baseline optimizations become load-bearing.

Barber notes that Andrej Karpathy recently joined Anthropic to work on richer automated research loops at the frontier. His point is that architectural composability and agentic research pipelines are not alternatives: you need both. A clever loop cannot generate optimally fused kernels if there is no clean baseline to verify against.

For teams currently selecting a base model for a product shipping in the next ninety days, treat attention variant and MoE configuration as specification items, not footnotes. The architecture determines which hardware is optimal, which context lengths are economical, and how predictably the model generalizes outside its training distribution.

Source: Ian Barber, ianbarber.blog, published June 19, 2026.

The transformer monoculture is over. Here is what replaced it.

The morning brief for people inside the AI industry.

More in Models

Diffusion LLMs Are Not an Interpretability Dead End

Mercury 2 hits 1,000 tokens per second. Here is what that buys you.

NVIDIA's ZPPO Fixes the Hardest-Question Problem in RL Post-Training