The Reliability Gap: Agent Benchmarks for Enterprise

Is agentic AI ready for enterprise use? A review of the key benchmarks and the state of the art.
Agents
Author

Paul Simmering

Published

2026-01-04

Unrealized potential due to lacking reliability

A 2025 survey by Pan et al. (2025) among 306 AI agent practitioners found that reliability issues are the biggest barrier to adoption of AI agents in enterprise. To achieve the reliability required, practitioners are foregoing open-ended and long-running tasks in favor of workflows involving fewer steps. They control potential damage by building internal facing agents whose work is reviewed by internal employees, rather than customer facing or machine-to-machine interfaces. These limited agents are economically useful, but don’t realize the full potential.

Image created with GPT Image 1.5

To quantify how large the reliability gap towards the full potential is, I’ll review public benchmarks for agentic AI. In October 2025, Schmid (2025) listed over 50 benchmarks. That’s too many to pay attention to, so in this article I’ll prioritize them from the perspective of an enterprise looking to automate common business tasks.

Benchmark selection criteria

  1. Relevance. Tests abilities relevant for business use cases, ideally the exact task that the enterprise is looking to automate.
  2. Agentic. Measures agentic abilities with multiple turns and tool use, not just single turn reasoning.
  3. Best in class. The benchmark is not overshadowed by a more comprehensive benchmark measuring the same or closely related ability or a newer version of the same benchmark.
  4. Leaderboard. A benchmark needs a public leaderboard with up-to-date models listed. This disqualifies the majority of benchmarks. Most are published as a paper with a few model scores, which are quickly outdated.
NoteInterpreting agentic benchmarks

Benchmark results are sensitive to the language model, the agentic loop code, tools available to the agent including their documentation, the benchmark harness (the environment in which the agent is evaluated), the evaluation method and random variations. Each score reflects a snapshot of all of these variables.

In addition to regular percentages of correctly completed tasks, benchmarks sometimes use two other metrics:

  • pass@k (pronounced “pass at k”): the probability of passing at least one of k runs. In other words, whether the agent is capable of succeeding at all if you let it try many times.
  • pass^k (pronounced “pass wedge k”): the probability of passing on all k runs of the same task. In other words, how many times you can expect the agent to succeed if you run it k times. Measured empirically by running each task k times and counting what fraction pass all attempts. Steeper decline indicates less consistent performance across runs.

From the lens of business automation, pass^k is more relevant. Unfortunately, most benchmarks only report pass^1, not higher pass^k metrics. Other important metrics that are not always reported are the time required to complete the task and the cost incurred in terms of input and output tokens. BFCL is an example of a benchmark that reports these metrics.

Benchmarks often have problems on release and are improved over time. For example, the original SWE-bench (2024) had ~68% of tasks that were unsolvable due to underspecified problems or unfair tests, leading to SWE-bench Verified’s human validation process.

Shankar (2025) goes into more detail on benchmark interpretation.

Key benchmarks for evaluating agents for enterprise use

Benchmark Task Best pass^1
GAIA Answer questions using tools and web 90% (SU Zero agent)
BFCL V3 Call functions correctly 77% (Claude Opus 4.5)
τ²-bench Serve customers with policy compliance 85% (Gemini 3 Pro)
Vending-Bench 2 Run a business over many turns $5,478 (Gemini 3 Pro)

I have selected these featured benchmarks based on the criteria listed above. Click on the benchmark name to learn more about each of the featured benchmarks. The scores here are pass^1. Only τ²-bench systematically reports pass^k metrics, though not for all entries. See its section for a detailed breakdown.

Specialty benchmarks with narrower focus

  • Coding: SWE-bench verified. Fix real GitHub issues from Python repositories. While highly relevant for evaluating coding capabilities, most enterprises will adopt existing AI coding tools (Claude Code, GitHub Copilot, Cursor) rather than develop custom coding agents. Leading score: 74.4% (Claude Opus 4.5, end of 2025).
  • Web automation: WebArena, Mind2Web. Navigate and complete tasks on real websites. Web browsing is partially covered in GAIA.
  • GUI automation: OSWorld, OfficeBench, AndroidWorld. Control Windows/Mac/Linux/Android via a graphical user interface. Only relevant if the agent must use a GUI instead of APIs. GUIs add a failure mode.
  • Safety: FORTRESS. Tests safeguard robustness vs over-refusal. Important for production deployments but not the focus of this article.

Which types of agents are ready for enterprise use?

Let’s consider a business that wants to automate a task. According to the survey of Pan et al. (2025), increasing productivity is the most common motivation. The baseline for accuracy is a human worker doing the task, who can also make mistakes. Unlike standard software, the expectation shouldn’t be 100% accuracy, but an acceptable trade-off for the benefits of automation. GAIA provides a human baseline of 92%, which is just 2 percentage points ahead of the best models.

I propose three stages of readiness:

  1. Internal tools reporting to humans, such as deep research, data analysis, information extraction, documentation and coding agents are ready now. The current highest scores for GAIA, BFCL and SWE-bench at the end of 2025 are 90%, 77.5% and 74.4%, respectively. Agents provide a profitable trade-off between accuracy and productivity. The time that humans spend checking results must be less than the time savings from automation.
  2. Customer facing tools, such as customer service agents. The challenge here is consistency, not capability. τ-bench shows models hitting 80% pass^1 but dropping significantly on pass^8, meaning the agent might handle a request perfectly one day and fail the next. Tight monitoring and a swift escalation path to a human is necessary.
  3. Long running autonomous work, such as inventory and portfolio management, scheduling, management of other agents over multiple tasks is not ready yet. Vending-Bench shows that even the best models show massive variance across runs, and there’s a risk of hitting meltdowns where they spiral into bizarre behavior.

References

Backlund, Axel, and Lukas Petersson. 2025. “Vending-Bench: A Benchmark for Long-Term Coherence of Autonomous Agents.” arXiv. https://doi.org/10.48550/arXiv.2502.15840.
Mialon, Grégoire, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. 2023. GAIA: A Benchmark for General AI Assistants.” arXiv. https://doi.org/10.48550/arXiv.2311.12983.
Pan, Melissa Z., Negar Arabzadeh, Riccardo Cogo, Yuxuan Zhu, Alexander Xiong, Lakshya A. Agrawal, Huanzhi Mao, et al. 2025. “Measuring Agents in Production.” arXiv. https://doi.org/10.48550/arXiv.2512.04123.
Patil, Shishir G, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. 2025. “The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models.” In Forty-Second International Conference on Machine Learning. ICML. https://openreview.net/forum?id=2GmDdhBdDk.
Schmid, Philipp. 2025. AI Agent Benchmark Compendium.” October 2025. https://www.philschmid.de/benchmark-compedium.
Shankar, Shrivu. 2025. “Understanding AI Benchmarks.” December 2025. https://blog.sshh.io/p/understanding-ai-benchmarks.
Yao, Shunyu, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. 2024. \(\tau\)-Bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.” arXiv. https://doi.org/10.48550/arXiv.2406.12045.