Production AI teams can replay a failed agent run, but they cannot yet answer the question that follows: what pattern of failures is actually happening across the last hundred thousand runs? Braintrust, the AI evaluation platform, shipped a feature called Topics on June 4 to close that gap, and the architecture behind it is more interesting than the product announcement suggests.
Ankur Goyal, Braintrust’s CEO, described the problem plainly on X. A single production agent trace can run to a million tokens and contain hundreds of spans covering tool calls, sub-agent invocations, and retrieved-context blobs. The documents that standard NLP clustering tools were built to handle are short, uniform, and semantically consistent. Agent traces are none of those things. Each span is a different artifact type with different semantics, which means a naive embedding of the full trace produces noise, not signal.
The Topics pipeline runs in six stages: preprocess, facet, embed, cluster, name, classify. The preprocessing step normalizes each span by type. The facet step splits each trace into operation-level units. Cheap embedding models then run on each facet individually. The clusters that form across many traces get named by an LLM, and those names become the labels used to classify new traces as they arrive. The key architectural decision is the LLM naming step. Because the LLM compresses each span into a short summary before anything reaches the embedding model, the raw trace never needs to fit inside the embedding context window. That single design choice is what makes the pipeline tractable at production scale.
The Clio connection is worth noting because it is a credibility signal, not a marketing claim. Anthropic’s Clio paper, published by the lab’s internal research team, described a similar approach for analyzing Claude conversation patterns at scale. Braintrust has taken that architecture and adapted it for production agent observability. The lab used the same insight for a different problem; Braintrust is shipping it as infrastructure anyone can buy.
Topics fits inside Braintrust’s existing suite alongside Loop (for evals) and Brainstore (for vector queries). The positioning is deliberate: Loop lets teams act on observed patterns; Topics surfaces what those patterns are. Neither tool is useful without the other, which is the right way to build an observability product around a workflow rather than a data type.
The deeper observation connects to a thread running through several infrastructure announcements this week. Memory architecture conversations around Sentra, Mem0, and OpenAI’s Dreaming work have all circled the same underlying problem: production AI systems now generate data at a scale where the next bottleneck is not raw capability but what you can see, measure, and consolidate from the output. Topics is what observability looks like once you accept that LLM summarization is the only viable preprocessing step. The framing that embedding models can ingest raw agent traces directly, if you just increase their context window enough, does not survive contact with a real production system.
Goyal shared the thread without disclosing adoption numbers or which customer verticals are using Topics in production. The release announcement does not include independent benchmarks on clustering accuracy or latency at scale. Those are the metrics that would distinguish Topics from a prototype pipeline a team could assemble from open-source components in a weekend.
For teams currently deploying multi-step agents and relying on per-trace logging to understand production behavior, the architectural argument Goyal makes is worth testing against your own trace corpus before the pattern-identification problem compounds further.
Ankur Goyal on X (Braintrust CEO), posted June 4, 2026.