Cursor's CursorBench 3.1 Grades Agents on Real, Messy Coding Tasks

The eval swaps clean toy problems for ambiguous, multi-file tasks pulled from actual Cursor sessions, and ranks models by cost too.

Alessandro Benigni

PUBLISHED JUL 3, 2026

2 MIN READ

Follow on Google

1 HR AGO

Cursor's CursorBench 3.1 Grades Agents on Real, Messy Coding Tasks — featured image for AI Insiders

Cursor released CursorBench 3.1, an update to its internal evaluation suite that scores coding agents on ambiguous, multi-file problems pulled directly from real Cursor sessions rather than isolated, hand-built exercises. The new version adds problem categories covering codebase comprehension, bug finding, task planning, and code review, alongside revised grading criteria for edit-type tasks. That focus matters because a coding agent’s usefulness rarely hinges on whether it can complete a tidy single-file exercise with one known correct answer.

CursorBench 3.0, the prior version, covered a narrower set of edit, refactor, and bug-fix tasks. Version 3.1 broadens that scope toward the kind of open-ended judgment calls agents actually face inside a live repository: which file to touch first, how to interpret an underspecified request, and whether a change quietly breaks something three modules away. A synthetic benchmark can reward pattern matching against a known answer key. A benchmark built from real sessions rewards the ability to operate under uncertainty, which is closer to what a team is actually paying an agent to do.

The leaderboard pairs an accuracy score against what each model costs per task, a figure Cursor derives from published token pricing weighted by what a model actually consumes solving the benchmark’s problems. Fable 5 Max leads with a score of 72.9 percent at an average of $18.02 per task. Composer 2.5 reaches 63.2 percent for 55 cents, and GLM 5.2 High scores 50.7 percent for $2.46. Cursor cautions that its own results carry variance, and that narrow score gaps between models may not reflect a meaningful difference in real capability.

The benchmark is Cursor’s own, built from Cursor’s own product usage and graded by Cursor’s own criteria, a limitation worth naming even as it explains why the eval is useful: a benchmark drawn from another company’s session logs would not capture what developers actually ask a coding agent to do inside Cursor specifically. The accuracy-versus-cost pairing is the more interesting result here. A model that clears 70 percent at eighteen dollars a task is not automatically the better choice over one that clears 63 percent for under a dollar; the right answer depends on how many tasks a team plans to run through the agent and what an individual failure costs.

Teams evaluating coding agents for production workflows should treat CursorBench 3.1’s cost-versus-accuracy pairing, not the leaderboard rank alone, as the number to act on, and should re-run their own task mix against both ends of that curve before locking in a model for high-volume agentic work.

Reported by Cursor on its CursorBench 3.1 evals page.

Cursor's CursorBench 3.1 Grades Agents on Real, Messy Coding Tasks

The morning brief for people inside the AI industry.

More in Tools

Anthropic Adds Cost Dashboards to Claude Enterprise as Agent Spend Grows

How OpenAI keeps voice AI fast for 900 million weekly users

PorTAL lets teams stop re-tuning every time a new model ships