Goodfire AI killed a language model’s ability to produce German text by fine-tuning on exactly four German tokens, a result the interpretability startup shared on X and documented in detail on LessWrong on June 26, 2026.
The subject was a 67-million-parameter language model. Standard fine-tuning approaches typically require thousands of examples to reliably suppress a behavior across a model. Getting there in four tokens suggests the team found a very precise handle on what controls German specifically.
The method they used is called parameter decomposition: rather than treating a weight matrix as a monolithic block to be adjusted wholesale, the approach splits it into interpretable, sparsely-activating subcomponents. Think of it less like turning down a volume knob and more like identifying which individual instrument in an orchestra is playing the German part and muting only that instrument.
In practice, the team tuned a single scalar factor attached to one German-specific subcomponent of the model’s weights. The initial pass targeted the 16 components most associated with German-language output. When they examined the component labels, most of those 16 turned out to be associated with foreign languages generally rather than German in particular. Narrowing the target to the single component that was genuinely German-specific improved the precision of the edit.
The comparison against LoRA, the dominant parameter-efficient fine-tuning method, is where the result becomes directly relevant to people building controllable or restricted models. LoRA matched the German-removal performance in the aggregate, but with notable collateral damage: French, Spanish, Italian, and in some cases English degraded alongside German. The parameter decomposition approach largely avoided that bleed. Other languages stayed intact. That specificity matters enormously if you are building a product that needs to restrict one behavior without accidentally degrading adjacent ones.
The broader context here is that surgical, interpretable model editing has been a stated priority for AI safety researchers for years. The practical problem has always been that weight updates are opaque: you can see what changed numerically, but not why the change produced the behavioral outcome it did. Parameter decomposition addresses that opacity directly. By building interpretable, labeled subcomponents, the approach gives the editor a named target rather than a diffuse gradient update distributed across millions of parameters. Whether this scales to frontier-sized models or to more complex behavioral constraints (versus linguistic ones, which are relatively localized) is the unanswered question.
Goodfire built this during a one-day hackathon while working on their product Silico, an interpretability tooling platform. The fact that the work emerged from a hackathon, rather than a dedicated research program, suggests the method is accessible enough to run quickly once you have the decomposition infrastructure in place.
For teams building fine-tuned models where behavioral specificity matters, this is a result worth running your own evaluation against. If your current fine-tuning approach is producing collateral degradation in capabilities you want to preserve, parameter decomposition may offer a lower-bleed alternative. The code and findings on LessWrong are detailed enough to attempt replication on a comparable small model.
Shared by Goodfire AI on X, with technical details published on LessWrong, on June 26, 2026.