The architecture arms race is a distraction

Every major frontier lab markets its model as an architectural breakthrough. Spend an afternoon reading the actual papers and a different picture appears: GPT, Claude, Gemini, and the leading open-weight families are built on the same transformer skeleton, refined through five years of incremental choices that most teams converged on independently.

That is the core finding in a detailed technical explainer published by 0xkato on June 6. The essay, aimed at developers who use these systems daily, walks through the full machinery from tokenization through next-token prediction. Its conclusion is quiet but consequential: the architectural differences across frontier models are mostly configuration choices, while the real differentiation lives in training data composition and the post-training stack applied on top of the base weights.

The shared skeleton is real. The standard modern transformer uses pre-norm placement, RMSNorm, Rotary Position Embeddings (RoPE), SwiGLU activations in the feed-forward network, and Grouped-Query Attention. None of these were designed together. They accumulated between 2017 and 2023 as separate papers solved separate problems, and most serious labs converged on all of them. Mixture-of-Experts routing is the one significant structural divergence at the frontier, and even that changes the cost math more than the fundamental logic.

What this means is that when a lab announces a new model generation, the architecture section of the release is the least informative part. The number of layers, the head count, the parameter total, whether it uses MoE routing: these are configuration variables, not moats.

Post-training is where the product is built. The 0xkato essay is precise on this point: the same base model, put through different post-training regimes, produces radically different chat behavior. Instruction tuning shapes whether a model follows instructions literally or tries to infer intent. Reinforcement learning from human feedback determines the model’s tolerance for refusals, its preferred output format, and the texture of its reasoning traces. The extended-thinking mode in reasoning models like Claude with extended thinking and o3 is not an architectural feature. It is a post-training choice that teaches the model to generate longer chain-of-thought traces before committing to an answer.

This has an immediate practical implication for teams making procurement decisions. Comparing two models on generic benchmarks measures the base architecture and the training scale. It does not tell you which model has been post-trained on work that resembles your workflow. A model with mediocre aggregate benchmark scores but deep post-training on legal documents will outperform a higher-ranked general model on legal tasks. The benchmarks will not show this.

Three stories this week decode differently once you hold this frame. Google’s Gemma 4 QAT release is primarily a compression story about inference cost, not a new capability story. The Apple arrangement to route Siri queries through a third-party frontier model is a post-training procurement decision: Apple is paying for access to a system that has already been post-trained at a scale Apple has not matched internally. The economics piece on AI subsidies is about who can afford to run the training compute that produces the base weights; it is a prerequisite story, not a capability story.

The benchmark-credibility problem is not going away. Labs have strong incentives to emphasize architectural novelty because it is harder to replicate than data curation or a post-training recipe. When a release announcement leads with attention mechanism variants rather than training data composition, that emphasis is a choice worth noticing. The 0xkato essay cites no independent evals of the proprietary models discussed. That is an honest limitation of any analysis built on public information about systems whose training details are not disclosed.

For builders currently selecting a foundation model for a specialized application, the question to prioritize is not which architecture the model uses. The question is which post-training work the lab has done in your domain, and whether the lab will disclose enough about data composition to make that judgment. Most will not. That gap is the actual competitive landscape.

Source: 0xkato on 0xkato.xyz, published 2026-06-06.

The architecture arms race is a distraction

The morning brief for people inside the AI industry.

More in Opinion

The AI subsidy hiding inside your $100 subscription

The economists inside the labs are thinking past AGI already

Anthropic publishes the empirical case for AI pause infrastructure