OpenAI has published details of a pre-release evaluation method called Deployment Simulation, which regenerates responses to real, anonymized production conversations using a candidate model before that model ships. The goal is to identify behavioral failures that only appear in genuine usage contexts, not in the curated adversarial prompts that dominate standard safety evaluations.
The mechanics are straightforward. OpenAI samples recent production conversations from users who have opted in to data use for model improvement, strips the original model’s response from each one, and runs the candidate model against the same conversation context. The resulting completions are then scanned for misalignment patterns, unwarranted refusals, and novel behaviors that existing evaluation categories had not anticipated.
The core argument for this approach is that representative traffic catches what adversarial benchmarks structurally cannot. Standard evals are built around known failure modes; a model facing a known test suite faces a distribution it was almost certainly exposed to during training. Real user conversations do not follow that pattern. They are messy, context-dependent, and distributed across thousands of use cases that no red team fully anticipates. OpenAI’s claim is that models are less likely to detect they are being evaluated when the context looks like ordinary usage, which should yield more natural behavior and reduce what the field sometimes calls evaluation gaming or sandbagging.
The method produced meaningful results on the GPT-5 series. OpenAI reported that Deployment Simulation accurately predicted the direction of behavioral change during that release cycle and surfaced novel misalignment issues before the model went live. The quantified calibration figure is a median multiplicative error of roughly 1.5x against post-release observed rates. That means if the simulation estimated one percent of conversations would exhibit a specific undesired behavior, the actual deployment rate would typically land between 0.67 and 1.5 percent. For a system operating at scale, that precision range is meaningful for go or no-go decisions, though it also means the method will occasionally underestimate or overestimate risk by a nontrivial margin.
OpenAI extended Deployment Simulation to agentic settings by simulating tool calls, not just chat completions. That extension matters because coding agents and other tool-using systems introduce failure modes that a text-only response replay would miss entirely: a model that behaves well in conversation can still misuse a function call, fail to validate an input, or take an irreversible action in an agentic loop. Covering that surface before deployment is a different problem than covering conversational safety, and the fact that OpenAI applied the same replay logic to tool call sequences suggests the method generalizes.
OpenAI positions Deployment Simulation as a complement to adversarial red-teaming and capability evaluations, not a replacement. That framing is accurate but worth examining. The method is limited to behaviors that exist in the current production distribution. A candidate model with a novel capability or a previously unseen failure mode would not be caught by replaying conversations shaped by the previous model’s behavior. Scripted adversarial testing remains necessary for probing capabilities the production population has not yet exercised.
The release announcement does not include independent verification of the 1.5x median error figure or describe how the system handles conversational contexts that are ambiguous about whether a failure actually occurred. Those are open questions for the broader eval community.
Labs evaluating their own safety infrastructure now have a concrete method to benchmark against. Teams building eval pipelines for fine-tuned or distilled versions of frontier models should assess whether their production traffic is large enough and diverse enough to make this approach viable at their scale.
Source: OpenAI, published June 16, 2026.