A new benchmark published on arXiv reveals a steep gap between what coding agents score on standard tests and what they can actually do when asked to maintain real software. RoadmapBench, introduced in a paper by Xinbo Xu and colleagues, comprises 115 tasks built from genuine version upgrades in open-source repositories. The best-performing model, Claude Opus 4.7, resolved only 39.1% of them. The weakest managed 5.2%.

The gap matters because most popular coding benchmarks measure something narrower: single-file bug fixes, usually in Python, scored by a binary pass or fail. That format suits short isolated tasks. It does not suit the work that engineering teams actually do when a library ships a major version, an API surface changes across dozens of call sites, or a codebase needs coordinated modifications in multiple languages at once.

RoadmapBench is built differently. Each of its 115 tasks drops an agent onto a source-version snapshot of a real repository and gives it a multi-target instruction describing the changes introduced in the target version. The agent must figure out which files to touch, what to rewrite, and in what order. The median task requires modifying 3,700 lines across 51 files. Repositories span five programming languages. The benchmark covers 17 distinct open-source projects, which means no single language or ecosystem dominates the evaluation.

That 51-file median is a stress test for context management alone. A model that can write a clean function on request must also decide, across an unfamiliar codebase, which 51 files are relevant and in what sequence to modify them without breaking intermediate states. Current agents are not reliably doing that. Thirteen frontier models were evaluated, and the spread between strongest and weakest is 34 percentage points, which indicates the task does discriminate between capability levels, but no model clears a threshold that would make it dependable for unsupervised upgrade work.

The benchmark authors frame long-horizon software development as a largely unsolved problem, and the numbers back that framing. Agents marketed for software engineering use cases are often benchmarked on SWE-bench, which focuses on GitHub issues and single-file patches. A 39% ceiling on a multi-file, multi-language, real-upgrade dataset suggests those scores overstate readiness for coordinated maintenance work.

For teams evaluating coding agents for production automation, RoadmapBench provides a more useful signal than current standard suites when the target workflow involves version upgrades, cross-file refactors, or multi-language repositories. Teams considering autonomous agents for upgrade pipelines should run internal evaluations against realistic multi-file tasks rather than relying on single-issue benchmark scores before committing to a deployment architecture.

Source: arXiv, paper “RoadmapBench: Evaluating Long-Horizon Agentic Software Development Across Version Upgrades” (arXiv

.15846), submitted May 15, 2026, revised May 19, 2026.