Turning up reasoning effort on a frontier model does not reliably improve its ability to find security vulnerabilities. That is the central finding from a recent experiment by Parsia on parsiya.net, who ran 26 model-effort combinations across Claude 4.6, 4.7, 4.8, and GPT-5.4 and 5.5 on two real-world C vulnerability test cases derived from the Mythos benchmark study.

The setup was straightforward. Each model-effort pair received either an entire source file or just the vulnerable function, with no tool access, and was asked to produce a security triage. An LLM council of four models (gpt-5.4-high, gpt-5.5-high, claude-4.6-high, claude-4.7-high) then judged the results. The council agreed unanimously 86.2 percent of the time, which the author notes is close to prior iterations and suggests the judging method is at least internally consistent.

The performance numbers are illuminating. On the overall normalized score, gpt-5.4 at extra-high reasoning effort led the field at 0.417. But gpt-5.5 at medium outperformed gpt-5.5 at high or extra-high, 0.360 versus 0.327 for both. For Claude models, high effort consistently beat extra-high across the 4.7 and 4.7-1m families. The pattern is not clean: higher reasoning is generally better, but the ceiling breaks in enough cases that blanket “use maximum reasoning” advice falls apart.

One number stands out as a reality check for anyone building automated security pipelines. Across all 2,080 runs, only 1.9 percent produced a complete vulnerability chain. The partial-finding rate was far higher at 70.8 percent overall, dropping to 1.7 percent for the harder openbsd-sack test case when the full file was passed instead of the isolated function. When given just the vulnerable function, models performed dramatically better: function-level found-rates ran roughly 45 to 50 percentage points higher than whole-file rates on the harder case. That gap is the most actionable finding in the data.

The author is direct about what this means: passing individual functions to a model, rather than entire source files, is a more reliable first-pass strategy. The rest of a large file is mostly noise, and the models behave accordingly, losing the signal. This aligns with how human reviewers operate, though it requires a preprocessing step to slice the code correctly.

Content filtering added an unexpected wrinkle. In earlier iterations, Claude models at extra-high reasoning had content filtering rates as high as 15 to 21 percent on security analysis tasks. The final 2,080-request run saw only two filtered responses, suggesting Anthropic has tuned this, but the variance across iterations is wide enough that teams running automated security workflows on Claude at high reasoning should expect occasional dead runs.

The total bill for the final iteration was roughly $2,340, with the full experiment (including failed runs) reaching approximately $9,200. GPT models were cheaper through GitHub Copilot in this setup, and they also scored higher on this specific task. Claude models had one behavioral edge: they were the only ones to cite CVEs in their analysis, which may matter for report formatting in some workflows.

The broader implication cuts against a common assumption in security tooling. Teams reaching for the newest model or the highest reasoning setting as a default optimization are not necessarily getting better triage. A well-scoped function-level pass with a mid-tier reasoning setting may outperform an expensive whole-file run at maximum effort. For any team using LLMs as a first-pass layer in a vulnerability review pipeline, this experiment is a prompt to benchmark their actual setup rather than trust the marketing hierarchy of model generations.

Based on research published by Parsia on parsiya.net on June 18, 2026.