New AI benchmarking tools evaluate real world performance

mercredi 25 juin 2025, 06:48 , par InfoWorld

A new AI benchmark for enterprise applications is now available following the launch of xbench, a testing initiative developed in-house by Chinese venture capital firm HongShan Capital Group (HSG).

The challenge with many of the current benchmarks is that they are widely published, making it possible for model creators to train their models to perform well on them and, as a result, reduce their usefulness as a true measure of performance. HSG says it has created a suite of ever-changing benchmarking tests, making it harder for AI companies to train on the test, and meaning they have to rely on more general test-taking capabilities.

HSG said its original intention in creating xbench was to turn its internal evaluation tool into “a public AI benchmark test, and to attract more AI talents and projects in an open and transparent way. We believe that the spirit of open source can make xbench evolve better and create greater value for the AI community.”

On June 17, the company announced it had officially open-sourced two xbench benchmarks: xbench-Science QA and xbench-DeepSearch, promising ”in the future, we will continuously and dynamically update the benchmarks based on the development of large models and AI Agents ….”

Real-world relevance

AI models, said Mohit Agrawal, research director of AI and IoT at CounterPoint Research, “have outgrown traditional benchmarks, especially in subjective domains like reasoning. Xbench is a timely attempt to bridge that gap with real-world relevance and adaptability. It’s not perfect, but it could lay the groundwork for how we track practical AI impact going forward.”

In addition, he said, the models themselves “have progressed significantly over the last two-to-three years, and this means that the evaluation criteria need to evolve with their changing capabilities. Xbench aims to fill key gaps left by traditional evaluation methods, which is a welcome first step toward a more relevant and modern benchmark. It attempts to bring real-world relevance while remaining dynamic and adaptable.”

However, said Agrawal, while it’s relatively easy to evaluate models on math or coding tasks, “assessing models in subjective areas such as reasoning is much more challenging. Reasoning models can be applied across a wide variety of contexts, and models may specialize in particular domains. In such cases, the necessary subjectivity is difficult to capture with any benchmark. Moreover, this approach requires frequent updates and expert input, which may be difficult to maintain and scale.”

Biases, he added, “may also creep into the evaluation, depending on the domain and geographic background of the experts. Overall, xbench is a strong first step, and over time, it may become the foundation for evaluating the practical impact and market readiness of AI agents.”

Hyoun Park, CEO and chief analyst at Amalgam Insights, has some concerns. “The effort to keep AI benchmarks up-to-date and to improve them over time is a welcome one, because dynamic benchmarks are necessary in a market where models are changing on a monthly or even weekly basis,” he said. “But my caveat is that AI benchmarks need to both be updated over time and actually change over time.”

Benchmarking new use cases

He pointed out, “we are seeing with efforts such as Databricks’ Agent Bricks that [it] is important to build independent benchmarks for new and emerging use cases. And Salesforce Research recently released a paper showing how LLMs fare poorly in conducting some practical tasks, even when they are capable of conducting the technical capabilities associated with the task.”

The value of an LLM, said Park, is “often not in the ability to solve any specific problem, but to identify when a novel or difficult approach might be necessary. And that is going to be a challenge for even this approach to benchmarking models, as the current focus is on finding more complex questions that can be directly solved through LLMs rather than figuring out whether these complex tasks are necessary, based on more open-ended and generalized questioning.”

Further to that, he suggested, “[it is] probably more important for 99% of users to simply be aware that they need to conceptually be aware of Vapnik-Chervonenkis complexity [a measure of the complexity of a model] to understand the robustness of a challenge that an AI model is trying to solve. And from a value perspective, it is more useful to simply provide context on whether the VC dimension of a challenge might be considered low or high, because there are practical ramifications on whether you use the small or large AI model to solve the problem, which can be orders of magnitude differences in cost.”

Model benchmarking, Park said, “has been quite challenging, as the exercise is both extremely high stakes in the multi billion dollar AI wars, and also poorly defined. There is a panoply of incentives for AI companies to cheat and overfit their models to specific tests and benchmarks.”

Next read this:

What misleading Meta Llama 4 benchmark scores show enterprise leaders about evaluating AI performance claims

Vector Institute aims to clear up confusion about AI model performance

Researchers reveal flaws in AI agent benchmarking

Lire la suite sur InfoWorld