Frontier coding agents have started to bunch together on the leaderboards that teams use to make buying decisions. SWE-Bench Pro, until now the leading agentic coding benchmark, spans only 11 repositories and carries verifier error rates that Datacurve.ai audited at 8 percent false positives and 24 percent false negatives. At those error margins, small differences between top models are noise, not signal. Datacurve.ai published DeepSWE on May 26 to close that gap.

DeepSWE covers 113 tasks across 91 repositories in TypeScript, Go, Python, JavaScript, and Rust. The median repository contributes a single task, which prevents any flagship framework from dominating the score. That breadth is a deliberate design choice: the team wanted a benchmark that approximates the full range of projects developers actually point coding agents at, not the well-documented, heavily maintained codebases that dominate public evaluation sets.

The benchmark rests on four claimed advances. First, contamination: every reference solution is written from scratch and never merged back into the upstream repository, so the fix does not enter the public GitHub record. Second, complexity: prompts are roughly half the length of SWE-Bench Pro prompts, yet the Datacurve.ai team reports that solutions require 5.5 times more code and about twice as many output tokens. Third, repository diversity: 91 repos versus 11 in SWE-Bench Pro Public. Fourth, verifier quality: graders are purpose-written from the task description, test observable behavior rather than implementation details, and ran three times during authoring to catch flaky results before they inflated model variance.

The verifier comparison is where Datacurve.ai makes its sharpest claim. The team drew 30 tasks at random from DeepSWE and SWE-Bench Pro, ran 3 rollouts each across 10 frontier agent configurations, and then asked an LLM judge to assess whether the patch actually implemented the requested behavior. The judge disagreed with SWE-Bench Pro’s verifier on 32 percent of trials and with DeepSWE’s verifier on 1.4 percent. A benchmark where nearly one in three pass/fail decisions may be wrong provides a weak basis for vendor selection.

The skeptical read is worth stating directly. DeepSWE is a single-team effort from Datacurve.ai, a company whose commercial interests align with demonstrating that existing benchmarks are inadequate. The contamination claim is structurally difficult to verify from outside: Datacurve.ai asserts the solutions are novel, but independent auditors cannot easily confirm that no similar fix exists in pre-training corpora. The LLM-as-judge methodology used to critique SWE-Bench Pro also introduces its own failure modes, since the judge’s own biases shape which trajectories it flags. And benchmarks that introduce genuine signal tend to attract training pressure quickly: within six months of wide adoption, labs typically begin tuning on similar long-horizon SWE patterns, which compresses the separation DeepSWE currently shows.

This benchmark fits a pattern that surfaced in the evaluation space this week. BenchBench and similar meta-evaluation efforts have pushed toward the same conclusion: benchmarks saturate, and when they do, the model selection decisions that relied on them become unreliable. DeepSWE is one proposed response, not the final answer.

For teams running coding-agent evaluations now, the practical implication is straightforward. SWE-Bench Pro should stay in the suite because it has historical comparisons and broad adoption. DeepSWE should be added as a second layer, particularly for workflows involving long-horizon tasks across unfamiliar codebases, which is precisely where SWE-Bench Pro’s narrow repository coverage loses resolution. If the separation DeepSWE shows between frontier models holds up under replication, it gives procurement and engineering teams a more defensible basis for picking a coding agent than leaderboard clustering currently permits.

Posted on the DeepSWE blog at deepswe.datacurve.ai on 2026-05-26.