OpenAI Unleashes FrontierScience for AI-Fueled Scientific Reasoning

mercredi 17 décembre 2025, 11:45 , par eWeek

Einstein a go-go. OpenAI is in the news again with something to help the scientific world.
The perennially busy firm has unveiled FrontierScience—an evaluation system that pushes artificial intelligence into uncharted territory by tackling the same complex scientific problems that typically challenge PhD researchers for weeks.
While previous benchmarks focused on recall, FrontierScience is advertised as genuine scientific reasoning across physics, chemistry, and biology. The system launched on Dec. 16 with over 700 carefully crafted questions designed by some of the world’s brightest scientific minds.
OpenAI’s GPT-5.2 model achieved a 77% success rate on olympiad-level problems—tasks that challenge the most gifted young scientists globally.
The performance gap
GPT-5.2 dominated structured olympiad-style questions with its 77% score, but then crashed dramatically when faced with open-ended research tasks, managing only 25% success.
This 52-point performance gap reveals what scientists are calling the “ambiguity barrier.” The benchmark’s creators assembled an unprecedented team—42 international olympiad medalists representing 109 medals, alongside 45 PhD scientists—to craft questions that would truly test AI reasoning. Some research questions are so complex that human experts estimate they would require several days of computer simulations or weeks of mathematical work to solve properly.
Consider this: when asked about “meso-nitrogen atoms in nickel(II) phthalocyanine,” researchers noted that running the computer simulations alone “could take several days”. Another question requesting derivation of “electrostatic wave modes” in plasma prompted one expert to admit: “I did a similar analysis earlier this year for a different kind of wave… I think it took about three weeks to do the maths correctly”.
New age of scientific discovery
The implications stretch far beyond impressive test scores—this benchmark signals we’re approaching a critical tipping point where AI transforms from sophisticated search engine to genuine research collaborator. When models eventually reach near-perfect scores on the research track, they’ll function as “very good collaborators” that can multiply the progress that PhD students or scientists can do.
The benchmark’s evaluation system represents a fundamental shift in AI assessment. Using 10-point rubrics graded by GPT-5 to assess reasoning quality, not just final answers, it moves beyond traditional “pass the test” mentalities to “can it do the job” evaluations. The progression is good—when similar PhD-level science benchmarks launched in November 2023, GPT-4 scored just 39%, but GPT-5.2 now hits 92% on those same questions.
This rapid advancement suggests we’re witnessing the emergence of AI systems that can genuinely contribute to scientific breakthroughs.
The race to scientific AI supremacy
This benchmark launch coincides with an unprecedented surge in AI research investment that’s reshaping the entire scientific landscape. The competition extends far beyond OpenAI—Google DeepMind’s AlphaFold has already predicted over 200 million protein structures, work that would have taken hundreds of millions of years to complete experimentally.
When AI models eventually close the 52-point gap between Olympiad and Research scores, they’ll handle ambiguous problems as easily as constrained ones. The speed of improvement suggests this limitation won’t persist long.
There was no reasoning with her. Hannah Wong, the executive who steered OpenAI through its most chaotic period, has announced she’s leaving the company.
The post OpenAI Unleashes FrontierScience for AI-Fueled Scientific Reasoning appeared first on eWEEK.

Lire la suite sur eWeek