Gray Swan Is Building the Security Layer AI Labs Cannot Build Themselves

The Carnegie Mellon spinout red-teams frontier models for Anthropic and others, and its founders say the first major AI breach is not a question of if.

Alessandro Benigni

PUBLISHED JUN 25, 2026

4 MIN READ

Follow on Google

-1043 MIN AGO

Gray Swan Is Building the Security Layer AI Labs Cannot Build Themselves — featured image for AI Insiders

Gray Swan, the Pittsburgh-based AI security company co-founded by CMU professors Zico Kolter and Matt Fredrikson, occupies a structural gap that is getting harder to ignore: frontier labs need adversarial evaluation of their own systems, and they cannot fully supply it themselves. Anthropic cited Gray Swan as an authority on the Mythos model card, commissioning the firm to stress-test the model’s resistance to indirect prompt injection before release. That citation is a datapoint worth examining.

The business logic for third-party red-teaming is the same as it has always been in security. An organization auditing its own attack surface faces a conflict of incentive. When Anthropic or another lab trains a model to refuse harmful outputs, the same safety training that produces refusals also suppresses the model’s willingness to actively attack other models. Gray Swan’s founders told Latent Space that frontier models make poor automated red teamers precisely because their safety training gets in the way. A specialized red-teaming model, trained on large volumes of adversarial data from skilled human attackers, has no such inhibition. Gray Swan’s system, called Shade, now outperforms human red teamers within fixed time windows on specific task sets, according to Fredrikson. That claim is qualified, not absolute, but the direction is notable.

The firm runs two parallel businesses from this position. One is the Gray Swan Arena, a community of roughly 15,000 participants who compete in prize challenges to find vulnerabilities in lab-sponsored models. The other is the automated red-teaming toolchain anchored by Shade. Together they produce the adversarial data that trains Gray Swan’s defensive product: Cygnal, a filter model that sits between a deployed agent and the outside world, checking both inbound content for injections and outbound tool calls for policy violations. The architecture is deliberate. You cannot build a robust guardrail product without a continuous adversarial signal to train against. Gray Swan controls both sides of that loop.

Kolter’s point about scale deserves attention from anyone building on agentic systems. General-purpose models do not become more robust to adversarial pressure simply by getting larger. Frontier labs have to train robustness explicitly, and even then the problem is not solved. In Gray Swan’s published benchmark data, the correlation between a model’s performance on standard capability evaluations and its resistance to indirect prompt injection is weak. A highly capable model is not automatically a secure agent. That gap is where Gray Swan’s customers live.

The firm is also moving into adjacent market territory that could prove more durable than red-teaming contracts alone. Kolter described an active working relationship with AI underwriting companies that need arm’s-length risk assessment before pricing a policy. The parallel to cyber insurance is structural: insurers need a credible third party to measure risk, and then a credible third party to prescribe and verify mitigations. Gray Swan occupies both roles. Fredrikson noted there is not yet a universally accepted compliance framework for AI deployments, the way SOC 2 functions for cloud infrastructure. Building toward that framework, or becoming the preferred auditor when one arrives, is the longer strategic play.

The most direct consequence for teams shipping agentic systems comes from what Gray Swan found while red-teaming computer-use agents. The attack surface on a browser-controlling agent is substantially wider than on a chat-based model. Gray Swan ran a controlled experiment pitting human participants against browser agents, with red teamers choosing whether to phish the humans or inject the agents. Skilled human attackers achieved 60 to 70 percent success rates against human participants. Some agents were highly vulnerable; a small number showed surprising resilience. What held across all conditions is that agents and humans fail for entirely different reasons, and the attack techniques transfer poorly between the two targets.

Any team currently deploying Codex, Claude Code, or a similar coding agent against production environments should treat that finding as a specification for what a security evaluation needs to cover. A model that passes a content-policy evaluation in a chat context is not thereby validated for agentic use. The evaluation has to be designed for the attack surface the agent actually presents.

Gray Swan’s founders expect the market for independent AI security evaluation to grow sharply as enterprises adopt agentic tooling at scale. The first major public prompt-injection breach, Fredrikson said, will likely accelerate that adoption the way a high-profile cyber incident accelerates demand for security auditing. Teams that commission independent red-teaming before that event will have a structural advantage over those that wait.

Reported by Latent Space in a podcast episode published June 18, 2026, featuring Gray Swan co-founders Zico Kolter and Matt Fredrikson.

Gray Swan Is Building the Security Layer AI Labs Cannot Build Themselves

The morning brief for people inside the AI industry.

More in Opinion

Why prompt injection is really a failure of role perception

Claude Code Shows You a Thinking Summary, Not the Actual Reasoning

The Hardware Ceiling That Could Slow AI's Biggest Bets