Most public benchmarks for enterprise agents test narrow, synthetic tasks. ServiceNow Research published EVA-Bench Data 2.0 on Hugging Face on June 4, expanding its evaluation set to three domains: Airline Customer Service Management, Enterprise IT Service Management, and Healthcare HR Service Delivery. Together they cover 213 scenarios across 121 tools, roughly four times the scope of the original release.
The scale matters because agent failures in enterprise deployments are domain-specific. A framework that handles flight rebooking without errors can break on FMLA policy lookups. The 121-tool surface area, with adversarial calls, multi-intent conversations, and unsatisfiable user goals included, is the kind of fan-out that exposes where orchestration logic goes brittle.
The commercial alignment is obvious: ServiceNow benefits if the canonical enterprise agent benchmark maps to ServiceNow’s own workflow topology. Read the scores with that in mind. The underlying dataset is still open under MIT and structured for drop-in use with standard evaluation harnesses, which gives it practical utility beyond ServiceNow’s leaderboard.
Teams evaluating enterprise voice or tool-calling agents should run their stack against this before committing to an architecture for customer-facing deployments.
ServiceNow Research on Hugging Face (huggingface.co/ServiceNow-AI), published June 4, 2026.