Spotify Engineering has published a framework for pairing LLM evaluations with online experiments, arguing that treating the two as competing methods is the wrong mental model. The right structure, according to senior data scientists Matilda Ankargren and Marten Schultzberg, is a funnel: evals filter before the experiment, and experiment outcomes calibrate the evals afterward.
The starting point in the Spotify Engineering blog post is a number worth sitting with. Only 12 percent of Spotify’s A/B tests end in a shipped positive result. Around 64 percent produce valid learning: a regression caught, an idea ruled out, a hypothesis refined. The win rate understates the value, but it also points to a resource question. If most tests do not ship, anything that raises the proportion of experiments that actually matter is worth the infrastructure cost.
The funnel model separates two tasks that teams often conflate. Evals verify: does the output conform to quality standards? Experiments validate: do real users respond as predicted? The sequencing matters. A strong eval stack means you do not run an experiment to find out whether a change does what you intend. Evals confirm that first. The experiment then answers whether the intended change drives the business outcome it was meant to, and whether secondary metrics stay healthy.
Spotify’s post illustrates the model with an LLM judge built to flag trust-breaking recommendations, cases where the system surfaces content that does not fit the user. That judge does two jobs: it discovers patterns the team did not know to look for, and after a fix ships, it verifies the flagged violations dropped. The thing the judge cannot tell you is whether users who received the improved version actually showed better long-term retention. That question requires an experiment.
The feedback loop closes when teams run their LLM evals on the A/B test data itself. Did the version the judge preferred actually perform better with users? When the gap between eval scores and experiment outcomes is large, the post describes that as diagnostic signal. The judge is capturing something, but not the thing that drives value. Each cycle recalibrates the next.
The structural skepticism here is well-documented in the field. Offline evals are proxies. They substitute a score for an outcome you actually care about, and that substitution is only valid as long as the score tracks the real outcome. The Spotify Engineering post names this directly, citing a concrete example: when Anthropic released Claude Opus, coding evals from Qodo showed no improvement, but the model had improved substantially on longer tasks. A controlled experiment would have surfaced that difference. Miscalibration runs in both directions, and without continuous offline-to-online signal comparison, evals produce opinions rather than evidence.
Spotify also reports that around 42 percent of launched experiments get rolled back to prevent regression in secondary metrics, including session length, crash rates, and retention. No offline eval flagged those regressions. The guardrail function of online experiments stays necessary regardless of how good the eval stack becomes.
For ML teams running A/B tests on LLM-powered features, the practical consequence of this framework is a sequencing change and an additional measurement pass. Evals run early, before experiment slots are consumed, to discard non-promising candidates and surface hypotheses. After results come in, teams run the same eval judges against the variant data to check whether the judge’s preference correlated with user outcomes. The post argues the value compounds when the system is simple enough to use and rigorous enough to trust.
Teams currently shipping LLM features without a calibrated offline-to-online feedback loop should treat that gap as a measurement debt: the longer it goes unaddressed, the less confident you can be that your eval scores predict anything a user would notice.
Posted on the Spotify Engineering blog on 2026-05-20.