MacMusic  |  PcMusic  |  440 Software  |  440 Forums  |  440TV  |  Zicos
models
Recherche

Vector Institute aims to clear up confusion about AI model performance

vendredi 11 avril 2025, 02:41 , par InfoWorld
AI models are advancing at a dizzying clip, with builders boasting ever more impressive performance with each iteration.

But how do the various models really stack up? And how can IT buyers truly know that vendors are being forthcoming with their results?

The Geoffrey Hinton-founded Vector Institute for Artificial Intelligence hopes to bring more clarity with its new State of Evaluations study, which includes an interactive leaderboard. The independent, non-profit AI research institute tested 11 top open and closed source models against 16 benchmarks in math, general knowledge, coding, safety, and other areas, fully open sourcing its results.

“Researchers, developers, regulators, and end-users can independently verify results, compare model performance, and build out their own benchmarks and evaluations to drive improvements and accountability,” said John Willes, Vector Institute’s AI infrastructure and research engineering manager. 

How did the top models do?

Vector Institute analyzed a range of state-of-the-art models:

Open source:

Qwen2.5-72B-Instruct (Alibaba)

Llama-3.1-70B-Instruct (Meta)

Command R+ (Cohere)

Mistral-Large-Instruct-2407 (Mistral)

DeepSeek-R1 (DeepSeek)

Closed source:

OpenAI GPT-4o

OpenAI o1

OpenAI GPT4o-mini

Google Gemini-1.5-Pro

Google Gemini-1.5-Flash

Anthropic Claude-3.5-Sonnet

Models were ranked on two types of benchmark: basic, comprising short, question-answer tasks; and agentic, requiring sequential decisions and tool use to solve multi-step challenges. They were tested on language understanding, math, code generation, general AI assistance, AI harmfulness, common sense reasoning, software engineering, graduate level intelligence and other tasks.

Model performance ranged widely, but DeepSeek and o1 consistently scored highest. Command R+, on the other hand, exhibited the lowest performance, but Willes pointed out that it is the smallest and oldest of the models tested.

Overall, closed source models tended to outperform open source models, particularly with the most challenging knowledge and reasoning tasks. That said, DeepSeek’s performance proves that open source can remain competitive.

“In simple cases, these models are quite capable,” said Willes. “But as these tasks get more complicated, we see a large cliff in terms of reasoning capability and understanding.”

One such task could be, for instance, a customer support function requiring a number of steps. “For complex tasks, there’s still engineering work to be done,” said Willes. “We’re a long way from really general purpose models.”

All 11 models also struggled with agentic benchmarks designed to assess real world problem-solving abilities around general knowledge, safety, and coding. Claude 3.5 Sonnet and o1 ranked the highest in this area, particularly when it came to more structured tasks with explicit objectives. Still, all models had a hard time with software engineering and other tasks requiring open-ended reasoning and planning.

Multimodality is becoming increasingly important for AI systems, as it allows models to process different inputs. To measure this, Vector developed the Multimodal Massive Multitask Understanding (MMMU) benchmark, which evaluates a model’s ability to reason about images and text across both multiple-choice and open-ended formats. Questions cover math, finance, music and history and are designated as “easy,” “medium,” and “hard.”

In its evaluation, Vector found that o1 exhibited “superior” multimodal understanding across different formats and difficulty levels. Claude 3.5 Sonnet also did well, but not at o1’s level. Again, here, researchers found that most models dropped in performance when given more challenging, open-ended tasks.

“There’s a lot of work going on right now that’s exploring how to make these systems really multimodal, so they can take text input, image input, audio input, and unify their capabilities,” said Willes. “The takeaway here is we’re not quite there yet.”

Overcoming challenges of benchmarking

Willes pointed out that one of the big problems with benchmarking is evaluation leakage, where models learn to perform well on specific evaluation datasets they’ve seen before, but not on new, unseen data.

“Once these benchmarks are out in the public domain, it’s awesome because others can replicate and validate,” he said. However, “there’s a huge challenge in making sure that when a model improves its performance in the benchmark, we’re sure it’s because we’ve had a step change in the model’s capability, not just that it’s seen the answers to the test.”

To help IT buyers make sense of its findings and apply the best models to their specific use cases, Vector has released all of its sample-level results.

“Most of the time, when people report these metrics, they give you a high level metric,” said Willes. But on Vector’s interactive leaderboard, users can click through and analyze every single question asked of the model, and the ensuing output.

So, if enterprise users have a particular use case they want to dig into, they can go very deeply into the results to gain that understanding. It is important to have a strong connection to real world use cases so that IT decision makers can do one-to-one mapping between the models being evaluated and what they’re building, Willes pointed out.

“That’s one of the things that we’re trying to solve here, is to make the methodology as open as possible,” he said.

To overcome some of the most common benchmarking challenges, Vector is advocating for more novel benchmarks and dynamic evaluation, he explained, such as judging models against each other and against a continuously-evolving scale.

“[Dynamic evaluations] have a lot more longevity and avoid a lot of the evaluation leakage issues,” said Willes. Ultimately, he said, “there’s a need for continued development in benchmarking and evaluation.”
https://www.infoworld.com/article/3959786/vector-institute-aims-to-clear-up-confusion-about-model-ai...

Voir aussi

News copyright owned by their original publishers | Copyright © 2004 - 2025 Zicos / 440Network
Date Actuelle
mar. 15 avril - 07:21 CEST