Anthropic argues containment starts at the environment layer, not the model

A new engineering post details how Anthropic designs agent isolation across Claude Code, claude.ai, and Cowork, and why model-layer defenses alone cannot hold.

Alessandro Benigni

PUBLISHED MAY 28, 2026

4 MIN READ

Follow on Google

2 DAYS AGO

Anthropic argues containment starts at the environment layer, not the model — featured image for AI Insiders

Anthropic published a detailed engineering post on May 26 laying out its containment architecture for agentic products, and the central argument is a direct challenge to how most teams think about AI safety: lock down the environment first, then worry about steering model behavior.

The post frames the deployment calculus in terms of blast radius. As agents grow more capable, the damage from a failure grows with them. Model training and safety work can reduce the probability of a failure, but Anthropic argues that capping the maximum damage requires architectural constraints at the system level: sandboxes, virtual machines, filesystem boundaries, and egress controls. A credential that never enters a sandbox cannot be exfiltrated, regardless of whether the failure comes from a user mistake, a model finding a creative path to a goal, or an external attacker.

Anthropic describes three isolation patterns across its three primary agent products. For claude.ai’s code execution, it uses ephemeral gVisor containers that are fully server-side, per-session, and isolated from user machines. For Claude Code, which runs on user hardware with full filesystem and shell access, it ships an OS-level sandbox using Seatbelt on macOS and bubblewrap on Linux, allowing reads and workspace writes while blocking network access by default. That change produced an 84 percent reduction in permission prompts. For Claude Cowork, built for non-technical knowledge workers, the containment is a full local virtual machine, because the human-in-the-loop model breaks down when users cannot read bash.

The post is honest about where the human-in-the-loop approach failed. Anthropic’s telemetry showed users approved roughly 93 percent of permission prompts, and approval rates went up with experience. More experienced users approved faster and supervised less. The pattern was approval fatigue masquerading as oversight. The fix was to push more of the safety work into the environment layer rather than ask users to make good decisions at every step.

Model-layer defenses are acknowledged and cited. On Gray Swan’s Agent Red Teaming benchmark, Claude Opus 4.7 holds prompt injection attack success to around 0.1 percent on single attempts. Claude Code’s auto mode catches roughly 83 percent of overeager behaviors before execution. But the post is explicit that probabilistic defenses have a nonzero miss rate and cannot stand alone.

Anthropic also describes two specific failures. A class of pre-trust vulnerabilities in Claude Code allowed malicious code to execute before the user had consented to a folder, because project-local settings were parsed at startup before the trust dialog appeared. Separately, a controlled red-team exercise in February 2026 demonstrated that a phishing email containing a ready-to-paste prompt could direct Claude Code to read AWS credentials and POST them to an external endpoint. Across 25 attempts, the exfiltration succeeded 24 times. The post notes that model-layer defenses cannot catch this class of attack because the instruction arrives through the user, not through tool output. Egress controls are the only layer that reliably stops it.

This framing serves Anthropic’s commercial interest as clearly as it serves engineering clarity. A philosophy that emphasizes environment-layer containment over alignment tuning is also a philosophy that lowers the barrier to deploying more agents more broadly. It shifts accountability toward whoever controls the runtime, which in enterprise deployments is usually the customer. That is a convenient liability frame for a company selling agent infrastructure.

The contrast with competitors is worth noting. Google’s Agent Executor architecture, covered here on May 21, is primarily an orchestration and execution-policy layer rather than a hardware isolation framework. OpenAI’s computer-use implementation relies heavily on model-layer refusals and user confirmation flows. Cursor’s cloud agent post from May 22 focused on task scoping as the primary safety mechanism. Anthropic’s post is the most detailed public treatment of OS-level primitives applied to agent containment from any of the frontier labs.

Teams deploying their own agents should audit the environment before touching the system prompt. Specifically: does your agent runtime have egress controls that block unexpected outbound connections? Are credentials excluded from the filesystem the agent can reach? Are project-local configs and startup hooks treated as untrusted inputs? If any of those answers are no, model training and prompt engineering are not a substitute.

Published on Anthropic’s engineering blog on 2026-05-26.

Anthropic argues containment starts at the environment layer, not the model

The morning brief for people inside the AI industry.

More in Agents

Claude Code's Dynamic Workflows rewrote Bun from Zig to Rust in 11 days

OpenAI's Codex builds a tax agent that patches its own failures

Anthropic plans an AI Fluency scorecard in Claude across 11 indicators