Goodfire, the San Francisco interpretability lab, published research on May 21 showing that sparse autoencoders (SAEs), the dominant tool for decomposing neural network activations into labeled features, systematically misrepresent the geometry underlying model representations when features are read in isolation.
SAEs work by expressing a network’s internal activations as a weighted combination of many learned directions in activation space. Interpretability researchers at labs including Anthropic have relied on SAEs with the working assumption that each direction, or feature, captures a distinct concept. A direction might activate for “cold weather” or “financial risk,” with its magnitude signaling intensity. Goodfire’s study, authored by researcher Usha Bhalla, challenges how far that assumption holds when the underlying geometry is curved rather than flat.
The core finding is that curved manifolds in neural representations, the kinds of structures that encode concepts like temperature, spatial relations, or semantic categories, are not captured by any single SAE feature. Instead, features fragment those manifolds across three distinct modes the paper names: shattering, where each point on a curved surface gets its own unique feature; compact capture, where a small set of shared features acts as a rough coordinate system for the whole manifold; and dilution, where a moderate number of features partially overlap, each covering a different region of the surface.
When Goodfire trained an SAE on actual neural network representations, dilution was what they observed in practice. A feature labeled “Cold weather and its effects” fires at one end of a temperature manifold. A feature for extreme heat fires at the other. Neither feature is wrong. Both are local descriptions of something neither alone can name.
Goodfire invokes the parable of blind men and an elephant to describe the problem. Each observer is locally accurate, but each describes a different part, and none of them arrives at “elephant.” A single SAE feature may label a meaningful region of a manifold correctly while leaving the broader structure invisible. Looking at features one by one produces locally accurate labels and globally incomplete understanding.
The proposed fix is an unsupervised pipeline. Rather than interpreting features in isolation, the pipeline collects the firing patterns of many SAE features across many activations, clusters features by statistical dependency in those patterns, and then asks what geometry the combined cluster spans. Applied to Llama 3.1 8B, the pipeline surfaced manifolds that the authors describe as surprisingly rich and specific. Goodfire notes it is developing new architectures tailored to unsupervised manifold discovery, rather than post-processing SAE outputs.
The structural caveat is real. This is a single lab’s study on one open-weight model, not a multi-institution replication across frontier architectures. Goodfire’s research was conducted without independent external validation, and the paper acknowledges that geometry alone is incomplete. Understanding how internal computations over those geometric structures produce model behavior is a separate, unsolved problem.
The practical consequence for teams building on SAE-based interpretability is this: if dilution is the norm, then a feature-by-feature audit of a model’s internal representations may be constructing a confident but partial story. Feature labels that look clean and specific may each be describing one leg of an elephant that the audit never assembles into a whole. Teams using SAE features to draw safety-relevant conclusions about what a model knows or represents should treat those features as starting hypotheses for cluster-level analysis, not as terminal units of meaning.
Published by Goodfire on 2026-05-21.