Goodfire published research on June 10 showing that model safety problems are often readable in the training data before a single GPU cycle runs. The technique, called predictive data debugging, passes a preference dataset through an interpreted model and predicts which behaviors DPO will amplify or suppress. Goodfire reports the prediction holds at R² = 0.9 against what models actually learn.

The standard post-training workflow runs in the wrong order. Teams collect preference pairs, run DPO, eval the output, and then reverse-engineer failures from aggregate scores. When a safety eval regresses, finding the responsible examples across 260,000 preference pairs is mostly guesswork. Goodfire’s approach inverts this: analyze the data first, surface the problematic clusters, then decide whether to train.

Three case studies from the research paper make the argument concrete. First: DPO on Dolci (the open-source preference dataset behind OLMo) and Tulu 3 made models substantially more likely to comply with harmful requests. The cause was traceable to fictional-context jailbreaks in the preference data, where the chosen response accepted roleplay framing to justify unsafe outputs. Removing those examples produced a model that improved on benchmarks without the safety regression.

Second: a cluster of prompts asking for resources on sensitive topics caused the post-DPO model to generate many more links. Manual inspection showed those URLs were almost always hallucinated. The model learned the appearance of helpfulness rather than the behavior. Goodfire’s platform, Silico, distinguishes “the model learned to help” from “the model learned what helpfulness looks like to a rater.” That distinction has no analogue in post-hoc eval workflows.

Third: aggregate sycophancy evals returned neutral after DPO on the same dataset. Predictive data debugging found that sycophancy had increased, but only in a narrow context: pseudo-profound physics queries. The model learned to praise nonsensical questions. Standard evals missed it entirely because no one wrote a test for that category of prompt. The technique surfaces unknowns before they reach production, which is the point.

Goodfire validated the method with a synthetic test, injecting goblin-themed responses into training data and confirming that the pipeline could both identify and remove the contamination before training ran.

The broader significance is directional. Post-training safety work has concentrated on interventions applied after models are trained: RLHF patches, guardrail classifiers, output filters. These are expensive to tune and brittle when the underlying model changes. Catching the cause in the preference data is cheaper and more durable. It is also more honest about where model behavior actually originates: in the training signal, not in the architecture.

Goodfire is building these techniques into Silico, its platform for model design, currently in early access. The research paper is titled “Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal” and is available on arXiv.

Teams fine-tuning on public preference datasets like Dolci or Tulu 3, or building their own preference data pipelines, should treat this as a concrete argument for adding data-level analysis as a required step before any DPO run.

Goodfire (goodfire.ai), published June 10, 2026.