Between 57 and 71 percent of the agent harnesses Mem0 surveyed leaked memory entries from one user session into another user’s context under specific conditions. That figure, published by the memory-infrastructure company on June 2, is not a performance regression or a latency problem. It is an audit boundary failure of the kind that enterprise procurement checklists explicitly screen for.

Mem0 evaluated eight production harnesses: Claude Code, OpenAI Codex, GitHub Copilot, OpenClaw, Hermes, AWS Bedrock AgentCore, Windsurf, and Devin. The evaluation methodology combined LongMemEval, a public long-memory benchmark, with a 15-session in-house corpus called coding-agent-life-v1. The contamination finding is the headline, but the survey’s other four findings form the structural context that explains why contamination is possible at all.

Every harness studied uses a bounded, mostly local memory pool. When the pool fills, old entries are evicted. No production harness in the survey uses a true unbounded persistent store. That ceiling on memory size creates pressure that designers resolve differently, and those resolution choices are where the cross-user boundary conditions emerge.

Retrieval quality compounds the problem. Most harnesses rely on keyword search or basic embedding lookup, not semantic or graph-based retrieval. Long-tail recall is weak. When a function signature changes or a user’s preference reverses, the memory layer does not detect the contradiction and continues to serve both the old fact and the new one to whoever queries next.

Harness scoping makes the situation harder to fix than it might appear. A memory entry written inside one harness is not portable to another. The models have been post-trained against their harness’s specific memory layer: Claude against Claude Code’s, GPT-5 against Codex’s. Copying raw memory entries between harnesses does not replicate the behavior because the model’s internal representation of those entries is harness-specific.

The architectural critique embedded in Mem0’s findings connects to a position Sentra articulated separately: memory in agent systems is state, not a service. A service can be swapped, upgraded, or isolated. State carries session history, trust provenance, and context across interactions in ways that the calling system cannot fully inspect. When memory is treated as a service bolt-on to a per-harness scratch pad, the result is exactly what Mem0 measured. The contamination is not a bug in any single harness; it is a predictable consequence of an architectural choice that was never audited at enterprise scale.

The contamination rates also explain why enterprise adoption of agent harnesses for production knowledge work has moved slower than capability benchmarks would predict. The blocking question is not “can this agent complete the task?” The blocking question is “whose context does the agent think it is in?” Standard security evaluations do not include a test for cross-session memory leakage because no public test exists. Enterprises relying on vendor assurances are effectively trusting that conditions specific enough to trigger the contamination will not arise in their deployments.

Any team evaluating an agent harness for use with client data or privileged code should add session-isolation testing to their procurement checklist before the contract is signed.

Mem0 (mem0.ai), 2026-06-02.