Navigation
Recherche
|
What misleading Meta Llama 4 benchmark scores show enterprise leaders about evaluating AI performance claims
mercredi 9 avril 2025, 03:49 , par InfoWorld
Benchmarks are critical when evaluating AI — they reveal how well models work, as well as their strengths and weaknesses, based on factors like reliability, accuracy, and versatility.
But the revelation that Meta misled users about the performance of its new Llama 4 model has raised red flags about the accuracy and relevancy of benchmarking, particularly when model builders tweak their products to get better results. “Organizations need to perform due diligence and evaluate these claims for themselves, because operating environments, data, and even differences in prompts can change the outcome of how these models perform,” said Dave Schubmehl, research VP for AI and automation at IDC. Vendors may fudge results, but it’s not likely to dissuade IT buyers On Saturday, Meta unexpectedly dropped two new Llama models, Scout and Maverick, claiming that Maverick outperformed GPT-4o and Gemini 2.0 Flash and achieved comparable results to the new DeepSeek v3 on reasoning and coding. The model quickly claimed second place behind Gemini 2.5 Pro on LMArena, an AI proving ground where human raters compare model outputs. The company also claimed that Scout delivered better results than Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across a broad range of benchmarks. However, independent researchers soon discovered, by reading the fine print on the Llama website, that Meta had used an experimental chat version of Maverick in testing that was specifically optimized for conversationality, as opposed to the publicly-available version it dropped over the weekend. Meta has denied any wrongdoing. It’s not the first time vendors have been misleading about benchmarking, and experts say the little bit of fudging isn’t likely to prompt enterprise buyers to look elsewhere. “Every vendor will try to use benchmarked results as a demonstration of superior performance,” said Hyoun Park, CEO and chief analyst at Amalgam Insights. “There is always some doubt placed on vendors that intentionally game the system from a benchmarking perspective, especially when they are opaque in their methods.” However, as long as leading AI vendors show that they are keeping pace with their competitors, or can potentially do so, there will likely be little to no long-term backlash, he said. He pointed out that the foundation model landscape is changing “extremely rapidly,” with massive improvements either in performance or productivity happening monthly, or even more frequently. “Frankly, none of today’s model benchmark leaderboards will be relevant in six to 12 months,” said Park. Enterprises: Do your due diligence with AI With the proliferation of models in the market, it’s naturally important for organizations and developers to have some idea of how AI will work in their environment, and benchmarks partially serve this need, Schubmehl pointed out. “Benchmarks provide a starting point, especially since performance is becoming increasingly important as applications using AI models become more complex,” he said. However, “evaluation with the organizations’ data, prompts, and operating environments is the ultimate benchmark for most enterprises.” Park emphasized that benchmarks are ultimately only as useful as accuracy of their simulated environments. So, for instance, for a defined transactional technology such as a server or database, metrics and guardrails can often simulate specific high traffic or compute-heavy environments fairly accurately. However, the goals of AI are often outcomes based, rather than related to tasks or rules-based workflows. For instance, the ability to answer a customer service question is different from solving a customer service request, Park noted. AI may be very good at the former task, but struggle with the intricacies of the chain-of-thought (CoT) across many permutations that is necessary to conduct the latter. Therefore, when evaluating models, enterprise buyers should first consider if benchmarked tasks and results match their business processes and end results, or whether the benchmarking stops at an intermediate point. They must conceptually understand the processes and work that is being supported or automated, and align benchmark results to their current work process. It is also important to ensure that the benchmark environment is similar to the business production environment, he said, and to document areas where network, compute, storage, inputs, outputs, and contextual augmentation of the benchmark environment differ from the production environment. Further, make sure that the model tested matches the model that is available for preview or for production, Park advised. It is common for models to be optimized for a benchmark, without revealing deep detail into the cost or time required for the training, augmentation, or tuning going into that optimization. Ultimately, “businesses seeking to conduct a competitive evaluation of AI models can use benchmarks as a starting point, but really need to scenario test in their own corporate or cloud environments if they want an accurate understanding of how a model may work for them,” Park emphasized.
https://www.infoworld.com/article/3957715/what-misleading-meta-llama-4-benchmark-scores-show-enterpr...
Voir aussi |
56 sources (32 en français)
Date Actuelle
mar. 15 avril - 07:28 CEST
|