AI agent benchmarks are misleading, study warns

AI agent benchmarks are misleading, study warns

We wish to hear from you! Take our short AI survey and share your thoughts on the current state of AI, the way you’re implementing it, and what you expect in the future. Learn more


AI agents are emerging as a promising recent research direction with potential real-world applications. These agents use basic models corresponding to large language models (LLMs) and visual language models (VLMs) to simply accept natural language instructions and execute complex goals autonomously or semi-autonomously. AI agents can use a number of tools corresponding to browsers, search engines like google, and code compilers to confirm their actions and reason about their goals.

However, last evaluation by researchers from Princeton University revealed a variety of shortcomings in current agent benchmarking and evaluation practices that hinder their usability in real-world applications.

- Advertisement -

Their findings indicate that agent benchmarking has its own challenges and that we cannot evaluate agents in the same way that we benchmark baseline models.

Cost vs. Accuracy Trade-off

One of the most important issues that the researchers indicate in their study is the lack of cost control in agent evaluations. AI agents will be significantly dearer to run than a single model invocation, because they often rely on stochastic language models that may produce different results when given the same query multiple times.


Countdown to VB Transform 2024

Join enterprise leaders in San Francisco July 9/11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn tips on how to integrate AI applications into your industry. Register now


To increase accuracy, some agent systems generate several responses and use mechanisms corresponding to voting or external validation tools to pick out the best response. Sometimes, sampling tons of or 1000’s of responses can improve the agent’s accuracy. Although this approach can improve performance, it incurs significant computational costs. Inference costs are not all the time an issue in research environments, where the goal is to maximise accuracy.

In practical applications, nevertheless, the budget available for each query is limited, making it crucial to regulate the cost of evaluating agents. Failure to do so can encourage researchers to design extremely expensive agents simply to get to the top of the leaderboard. The Princeton researchers propose visualizing the evaluation results as a Pareto curve of accuracy and inference cost, and using techniques that jointly optimize the agent with respect to those two metrics.

Researchers evaluated the trade-offs between accuracy and costs of various prompting techniques and agent patterns presented in different papers.

“With broadly similar accuracy, the cost could differ by almost two orders of magnitude,” the researchers write. “However, the cost of running these agents is not the primary metric reported in either paper.”

The researchers say that optimizing each metrics can result in “agents that cost less while maintaining accuracy.” Joint optimization may also allow researchers and developers to trade off between the fixed and variable costs of running an agent. For example, they will spend more on optimizing the agent design but reduce the variable cost by using fewer context-based learning examples in the agent prompt.

Scientists tested the optimization of connections on HotpotQApopular benchmark in Q&A. Their results show that the joint optimization formulation provides a technique to achieve an optimal balance between accuracy and inference cost.

“Useful agent evaluations must take costs into account—even if we are ultimately not interested in costs but only in identifying innovative agent designs,” the researchers write. “Accuracy alone cannot identify progress, because it can be improved by scientifically meaningless methods such as retries.”

Model development vs. downstream applications

Another problem that researchers indicate is the difference between evaluating models for research purposes and developing downstream applications. In research, accuracy is often the primary goal, and inference costs are largely ignored. However, when developing real-world applications on AI agents, inference costs play a key role in deciding which model and technique to make use of.

Assessing the cost of inference for AI agents is difficult. For example, different model providers may charge different amounts for the same model. Meanwhile, the cost of API calls changes repeatedly and can vary based on developer decisions. For example, some platforms charge different prices for bulk API calls.

Scientists have created website which adjusts model comparisons based on token prices to resolve this problem.

They also conducted a case study on NewQAbenchmark for very long question-answering tasks. They found that benchmarks designed for model evaluation will be misleading when used for further evaluation. For example, the original NovelQA study shows search-assisted generation (RAG) performing significantly worse than long-context models than in a real-world scenario. Their findings show that the RAG and long-context models were about as accurate, while the long-context models are 20 times dearer.

Overfitting is a problem

When learning recent tasks, machine learning (ML) models often find shortcuts that allow them to perform well on benchmarks. One major variety of shortcut is “overfitting,” where the model finds ways to cheat on benchmarks and deliver results that don’t translate to the real world. The researchers found that overfitting is a major problem for agent-based benchmarks because they have an inclination to be small, typically consisting of just a few hundred samples. This problem is more serious than data contamination in training basis models, because knowledge about the test samples will be directly programmed into the agent.

To address this issue, the researchers suggest that benchmark developers create and maintain holdout test sets, which consist of examples that can’t be memorized during training and can only be solved by properly understanding the goal task. In their evaluation of 17 benchmarks, the researchers found that many of them didn’t have proper holdout data sets, allowing agents to take shortcuts, even unintentionally.

“To our surprise, we found that many agent benchmarks do not include test sets,” the researchers write. “In addition to creating a test set, benchmark creators should consider keeping it secret to prevent LLM contamination or agent overfitting.”

They also stated that various kinds of extracted samples are needed, depending on the desired level of generality of the task performed by the agent.

“Benchmark developers must do everything in their power to ensure that shortcuts are impossible,” the researchers write. “We believe that this is the responsibility of benchmark developers, not agent developers, because designing benchmarks that do not allow shortcuts is much easier than checking each agent to see if it takes shortcuts.”

Scientists conducted tests WebArenabenchmark that assesses the performance of AI agents in solving problems with different web sites. They found several shortcuts in the training data sets that allowed the agents to over-fit to tasks in a way that may easily break with small changes in the real world. For example, the agent might assume a structure of web addresses without considering that it’d change in the future or that it wouldn’t work across different web sites.

Researchers warn that these errors inflate accuracy estimates and result in overoptimism about the agents’ capabilities.

Because AI agents are a recent field, the research and development communities still have a lot to learn about testing the limits of those recent systems that would soon develop into an necessary a part of on a regular basis applications.

“Comparing AI agents is new, and best practices have not yet been established, making it difficult to distinguish real progress from noise,” the researchers write. “Our thesis is that agents are sufficiently different from models that benchmarking practices need to be rethought.”

Latest Posts

Advertisement

More from this stream

Recomended