OpenAI published a guide on May 29 called “A shared playbook for trustworthy third-party evaluations,” outlining how external researchers and auditors should structure their assessment reports of frontier AI models. The document shipped alongside OpenAI’s Frontier Governance Framework, which we covered on May 30, but the two artifacts are distinct: the governance document addresses OpenAI’s internal risk policies; the playbook is aimed at third-party evaluators, academic labs, and regulators conducting their own assessments.

The playbook prescribes four requirements for any evaluation report that claims to measure real-world capability. First, state the claim precisely: what broader conclusion is this evaluation designed to support? Second, describe representativeness: how closely do the tested conditions reflect that claim? Third, document the harness in full detail, including which tools were available to the model, whether state was preserved across steps, and how errors and retries were handled. Fourth, provide supporting evidence sufficient for a reader to verify how the result was produced and how it would generalize beyond the tested conditions.

The harness-dependence section is the substantive contribution. A harness that preserves state across steps and retries failed actions can allow a model to complete a multi-step task that the same model would never complete in a simpler harness. When two evaluations use different harness configurations, a score difference between them reflects measurement setup as much as it reflects model capability. Most third-party evaluations published in 2025 underdocumented harness choices, which made cross-lab comparisons unreliable. Naming this requirement explicitly, and framing it as a community standard, is more useful than anything that appeared in the broader governance document.

The skepticism move is worth making clearly. OpenAI is the frontier lab most commonly being evaluated by third parties. OpenAI is also, with this document, positioning itself to define what counts as a well-formed evaluation report. A lab that sets the reporting standard for its own external audits has an obvious interest in how rigorous or flexible that standard is. The playbook does not appear to have been developed through a multi-stakeholder process with regulators or academic evaluation groups; it is presented as a company recommendation rather than a negotiated standard. That matters when evaluators face pressure to produce credible work within a framework the evaluated party authored.

For policy and trust-and-safety teams at AI-deploying enterprises, the practical read is this: regulators designing AI audit requirements are likely to treat published lab frameworks as an input, if not a reference. An evaluation report that documents harness choices, representativeness, and supporting evidence in the structure OpenAI describes will be harder to dismiss than one that does not. Reading the playbook is useful not because it is neutral but because it predicts which report formats will carry weight in upcoming compliance contexts.

Teams preparing for mandatory third-party evaluations under emerging AI governance regimes should benchmark their current evaluation documentation against OpenAI’s four-part structure before the next reporting cycle.

Source: OpenAI blog, “A shared playbook for trustworthy third-party evaluations,” published May 29, 2026.