OpenAI published a macro-evaluation workflow for multi-agent systems in its Developer Cookbook on May 22, describing a methodology that shifts the unit of analysis from a single failed run to patterns observed across an entire trace population. The framing is direct: most teams look at individual failures in isolation, which means they miss systemic issues that only surface when you analyze the full distribution.

The workflow structures the problem into two levels. Lower-level evals grade individual agents, handoffs, tool calls, and completed runs. The cookbook uses Promptfoo for this layer, scoring each run across dimensions that include final decision quality, policy correctness, specialist routing, market drift, and whether a review step fired when it should have. The macro-eval layer then aggregates those lower-level findings across hundreds or thousands of traces, asking a different question: which problems repeat, where do they concentrate, and which part of the agent workflow should receive human attention first.

Four labels organize the analysis: case_type (the input scenario), run_outcome (how the run ended), eval_finding (the local signal from lower-level evals), and behavior_pattern (the population-level cluster). The notebook compresses each trace into a compact document, then runs unsupervised clustering over those documents to surface recurring patterns. The output is not a comprehensive taxonomy of every trace. It is a short list of high-impact patterns, ranked by frequency and severity, that a technical or business stakeholder can act on.

The analytical gap this fills is real. Individual trace review is the default because it is operationally easy: a run fails, someone opens the log, finds the proximate cause, and writes a fix. The problem is that this approach optimizes for the loudest failure rather than the most frequent one. A pricing agent that ignores an incentive signal on 12 percent of runs may never produce a catastrophic single failure, but it silently degrades outcomes at scale. That kind of pattern is invisible without population-level analysis.

The Spotify funnel framework we covered last week for measuring agent effectiveness connects directly. Spotify’s approach segments agent runs by input type to find where conversion drops. OpenAI’s macro-eval workflow provides the failure-taxonomy layer that sits underneath that kind of funnel view. The two approaches are complementary: the funnel tells you where outcomes differ; the macro-eval tells you why.

One constraint worth naming: the cookbook is built on OpenAI’s own infrastructure. The trace format assumes OpenAI’s Responses API, the eval labels come from Promptfoo configured against OpenAI outputs, and the clustering layer expects the document schema the cookbook defines. Teams running agents on Anthropic models, open-weight stacks, or their own tracing infrastructure will need to adapt the data normalization step before the rest of the pipeline applies. The conceptual framework is portable. The implementation is not plug-and-play across every stack.

The cookbook does not include independent benchmarks showing how much this workflow improves system quality in production. The methodology is described clearly and the notebook is runnable with precomputed synthetic data, but the evidence base is illustrative rather than validated on real production workloads. That distinction matters for teams deciding how much engineering time to invest in rebuilding their eval cadence.

Teams running production agent pipelines should treat the next 30 to 60 days as the window to audit whether their current eval setup produces population-level views or only per-run diagnostics. Building even a basic aggregation layer over existing trace data, clustering eval_finding labels by frequency rather than recency, is likely to surface at least one systemic pattern that per-run review has not caught.

Published in the OpenAI Developer Cookbook on 2026-05-22.