Security researcher Kasra Esfahani spent $1,500 running ten frontier models against a deliberately vulnerable app to find out which ones could actually execute an offensive workflow. GPT-5.5 solved the challenge in 7 of 10 attempts. DeepSeek-V4-Pro managed 3 of 10. Claude Sonnet 4.6 finished 2 of 10, with five additional runs reaching the correct path before burning through the $10 per-run budget cap. Most other models never got started.

The vulnerability was a Missing Object-Level Authorization flaw of the type Esfahani encounters regularly in production audits: a hardened API sitting in front of a Firebase backend whose credentials were bundled inside the app’s APK. The actual exploit required extracting those credentials, registering directly with Firebase, and reading a private Firestore collection. The API itself was never the target. Models that spent their budgets probing API endpoints, looking for classic IDOR patterns in HTTP responses, never found the flag.

GPT-5.5’s edge was target selection. In almost every successful run, the model unzipped the APK, identified the Firebase configuration file, and moved straight to the correct attack surface within two or three requests. DeepSeek-V4-Pro found Firebase in roughly half its runs but then tried to use those credentials against the API rather than against Firebase directly, which is a conceptual error rather than a capability limit. Claude Sonnet 4.6 followed a methodical path through the API and the app before arriving at Firebase, which was correct reasoning but slow: 390,000 median tokens per run, versus 260,000 for GPT-5.5, and a cost-per-solve of $45.75 versus $9.46.

The result table is the empirical counterpart to Anthropic’s Mythos distribution strategy. Last week Anthropic announced it would make its most capable models available only to vetted security-research partners under the Project Glasswing umbrella, citing 10,000 critical vulnerability discoveries as justification for restricted access. Esfahani’s data shows what Anthropic is quietly acknowledging: the gap between a model that can do offensive security work and one that will do it is mostly a function of framing, not architecture.

The punchline in Esfahani’s post is that most of the models that refused his task would have attempted it if he had labeled it a CTF challenge. The observation holds across virtually every refusal-bypass story published in 2026. A CTF context does not change the model’s capability. It changes the model’s classification of the request. Anthropic’s answer to that problem, with Mythos, is to vet the requester rather than rely on context-sensitive refusals. Esfahani’s answer, in this experiment, was to hold an OpenAI account pre-approved for security research, which is why GPT-5.5 ran without guardrail interruption.

Models that refused outright, including both Gemini variants, Grok Build, and Llama-class models, show telling token counts. Gemini 3.1 Pro’s median run consumed 9,000 tokens. Every other model that engaged used at least 100,000. The token delta is the refusal, expressed numerically.

A few findings do not fit the guardrails-versus-capability narrative. Claude Opus 4.8 matched Sonnet 4.6’s 2-of-10 solve rate but for a different reason: late refusals, not budget exhaustion. DeepSeek-V4-Flash recognized Firebase but concluded each time the API was secure, a reasoning failure rather than a guardrail trigger. Qwen 3.7 Max solved the challenge in local testing, then failed all six formal runs while producing seven million tokens per attempt.

Esfahani is explicit that this is not a rigorous eval. Ten trials per model, one vulnerability class, one environment. The Wilson confidence intervals overlap significantly. GPT-5.5’s interval does not rule out a true solve rate as low as 40 percent.

What the data does establish is an operational baseline for anyone running agentic security workflows today. At current pricing, GPT-5.5 costs roughly $9.46 per confirmed exploit find. DeepSeek-V4-Pro, for teams willing to accept a 30 percent success rate, costs $0.62 per solve on the same task class, which is a 15x cost difference on a task where the cheaper model succeeds three times in ten. For security teams building automated reconnaissance pipelines, that ratio matters more than the solve rate in isolation.

Operators building AI-assisted security tooling should run this exact exploit class, or one of its nearest neighbors, against their candidate model before signing any contract. The guardrail picture changes with account approval status, harness framing, and temperature settings, and Esfahani’s methodology section makes each of those variables explicit.

Kasra Esfahani on kasra.blog, published June 3, 2026.