Cursor released CursorBench 3.1, an update to its internal evaluation suite that scores coding agents on ambiguous, multi-file problems pulled directly from real Cursor sessions rather than isolated, hand-built exercises. The new version adds problem categories covering codebase comprehension, bug finding, task planning, and code review, alongside revised grading criteria for edit-type tasks. That focus matters because a coding agent’s usefulness rarely hinges on whether it can complete a tidy single-file exercise with one known correct answer.

CursorBench 3.0, the prior version, covered a narrower set of edit, refactor, and bug-fix tasks. Version 3.1 broadens that scope toward the kind of open-ended judgment calls agents actually face inside a live repository: which file to touch first, how to interpret an underspecified request, and whether a change quietly breaks something three modules away. A synthetic benchmark can reward pattern matching against a known answer key. A benchmark built from real sessions rewards the ability to operate under uncertainty, which is closer to what a team is actually paying an agent to do.

The leaderboard pairs an accuracy score against what each model costs per task, a figure Cursor derives from published token pricing weighted by what a model actually consumes solving the benchmark’s problems. Fable 5 Max leads with a score of 72.9 percent at an average of $18.02 per task. Composer 2.5 reaches 63.2 percent for 55 cents, and GLM 5.2 High scores 50.7 percent for $2.46. Cursor cautions that its own results carry variance, and that narrow score gaps between models may not reflect a meaningful difference in real capability.

The benchmark is Cursor’s own, built from Cursor’s own product usage and graded by Cursor’s own criteria, a limitation worth naming even as it explains why the eval is useful: a benchmark drawn from another company’s session logs would not capture what developers actually ask a coding agent to do inside Cursor specifically. The accuracy-versus-cost pairing is the more interesting result here. A model that clears 70 percent at eighteen dollars a task is not automatically the better choice over one that clears 63 percent for under a dollar; the right answer depends on how many tasks a team plans to run through the agent and what an individual failure costs.

Teams evaluating coding agents for production workflows should treat CursorBench 3.1’s cost-versus-accuracy pairing, not the leaderboard rank alone, as the number to act on, and should re-run their own task mix against both ends of that curve before locking in a model for high-volume agentic work.

Reported by Cursor on its CursorBench 3.1 evals page.