Cognition, the company behind Devin, published FrontierCode on June 8, a benchmark built around one question: would a senior maintainer merge this pull request? The answer, across every frontier model tested, is mostly no.
The distinction the benchmark draws matters. SWE-bench and its variants score whether generated code passes tests or matches a reference solution. FrontierCode scores whether the PR-shaped output would survive a real code review, evaluating correctness alongside scope discipline, style consistency, idiomatic patterns, test quality, and regression safety. Those are the dimensions that determine whether AI-generated code ships or gets kicked back.
The results are stark. On FrontierCode Diamond, the hardest 50-task subset, Claude Opus 4.8 leads all models with a score of 13.4%. GPT-5.5 scores 6.3% and Gemini 3.1 Pro reaches 4.7%. On the full 150-task Extended set, Opus 4.8 climbs to 51.8%, but that still means roughly half of its output would fail a maintainer’s bar. GPT-5.5 scores lower on raw performance but uses up to four times fewer tokens than Opus 4.8, which Cognition notes as a better cost-efficiency tradeoff. Open-source models lag considerably: Kimi K2.6, the top open-source result, scores 3.8% on Diamond.
The benchmark was built with 20-plus maintainers from 36 flagship open-source repositories, each spending more than 40 hours per task. They authored the evaluation criteria, meaning the definition of “mergeable” came from people who actually approve commits rather than from benchmark designers approximating what mergeability means.
Cognition reports that FrontierCode produces 81% fewer misclassification errors than SWE-Bench Pro, validated through METR’s prior finding that high-scoring models on existing benchmarks frequently generate patches real maintainers reject.
The grading methodology introduces three techniques worth noting. A “reverse-classical” test runs agent-written tests against the original broken codebase and requires those tests to fail, confirming the agent actually understood the problem rather than writing vacuous tests. A scope criterion automatically checks that patches modify only what they need to. Adaptive classical grading uses an LLM to adjust reference tests to match valid alternative implementations, avoiding false failures caused by superficial differences in function names or error strings.
An example from the benchmark illustrates the gap well. Claude Opus 4.8, given a C++ task involving a LOG_WARNING() helper function, consistently mixed LOG_WARNING() and std::cerr calls on multi-line messages. Both produce identical runtime behavior. Only one follows the codebase convention. FrontierCode flags it; SWE-bench-style evaluation would not.
The strategic framing here is deliberate. Cognition sells Devin as an agent that produces merge-ready output. FrontierCode is the instrument that makes that claim testable, and it measures exactly the dimension where Devin’s competitors are weakest. The benchmark’s tasks will not be released publicly to prevent contamination, but Cognition says it is opening evaluation access to all model developers.
The release announcement does not include Devin’s own FrontierCode scores, which is notable given that the benchmark is Cognition’s product and the post frames Devin as the answer to the problem FrontierCode quantifies.
For teams currently evaluating AI coding tools on SWE-bench metrics, FrontierCode scores now represent the more relevant production signal. A model or agent that cannot clear 50% on FrontierCode Extended is generating output that engineering teams will spend time reviewing and revising, and that friction does not appear in correctness-only numbers.
Cognition (cognition.ai/blog), 2026-06-08.