OpenAI admits AI hallucinations are mathematically inevitable, not just engineering flaws

jeudi 18 septembre 2025, 14:37 , par ComputerWorld

OpenAI, the creator of ChatGPT, acknowledged in its own research that large language models will always produce hallucinations due to fundamental mathematical constraints that cannot be solved through better engineering, marking a significant admission from one of the AI industry’s leading companies.

The study, published on September 4 and led by OpenAI researchers Adam Tauman Kalai, Edwin Zhang, and Ofir Nachum alongside Georgia Tech’s Santosh S. Vempala, provided a comprehensive mathematical framework explaining why AI systems must generate plausible but false information even when trained on perfect data.

[ Related: More OpenAI news and insights ]

“Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty,” the researchers wrote in the paper. “Such ‘hallucinations’ persist even in state-of-the-art systems and undermine trust.”

The admission carried particular weight given OpenAI’s position as the creator of ChatGPT, which sparked the current AI boom and convinced millions of users and enterprises to adopt generative AI technology. (See also: OpenAI, Microsoft discuss shape of future relationship.)

OpenAI’s own models failed basic tests

The researchers demonstrated that hallucinations stemmed from statistical properties of language model training rather than implementation flaws. The study established that “the generative error rate is at least twice the IIV misclassification rate,” where IIV referred to “Is-It-Valid” and demonstrated mathematical lower bounds that prove AI systems will always make a certain percentage of mistakes, no matter how much the technology improves.

The researchers demonstrated their findings using state-of-the-art models, including those from OpenAI’s competitors. When asked “How many Ds are in DEEPSEEK?” the DeepSeek-V3 model with 600 billion parameters “returned ‘2’ or ‘3’ in ten independent trials” while Meta AI and Claude 3.7 Sonnet performed similarly, “including answers as large as ‘6’ and ‘7.’”

OpenAI also acknowledged the persistence of the problem in its own systems. The company stated in the paper that “ChatGPT also hallucinates. GPT‑5 has significantly fewer hallucinations, especially when reasoning, but they still occur. Hallucinations remain a fundamental challenge for all large language models.”

OpenAI’s own advanced reasoning models actually hallucinated more frequently than simpler systems. The company’s o1 reasoning model “hallucinated 16 percent of the time” when summarizing public information, while newer models o3 and o4-mini “hallucinated 33 percent and 48 percent of the time, respectively.”

“Unlike human intelligence, it lacks the humility to acknowledge uncertainty,” said Neil Shah, VP for research and partner at Counterpoint Technologies. “When unsure, it doesn’t defer to deeper research or human oversight; instead, it often presents estimates as facts.”

The OpenAI research identified three mathematical factors that made hallucinations inevitable: epistemic uncertainty when information appeared rarely in training data, model limitations where tasks exceeded current architectures’ representational capacity, and computational intractability where even superintelligent systems could not solve cryptographically hard problems.

Industry evaluation methods made the problem worse

Beyond proving hallucinations were inevitable, the OpenAI research revealed that industry evaluation methods actively encouraged the problem. Analysis of popular benchmarks, including GPQA, MMLU-Pro, and SWE-bench, found nine out of 10 major evaluations used binary grading that penalized “I don’t know” responses while rewarding incorrect but confident answers.

“We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty,” the researchers wrote.

Charlie Dai, VP and principal analyst at Forrester, said enterprises already faced challenges with this dynamic in production deployments. ‘Clients increasingly struggle with model quality challenges in production, especially in regulated sectors like finance and healthcare,’ Dai told Computerworld.

The research proposed “explicit confidence targets” as a solution, but acknowledged that fundamental mathematical constraints meant complete elimination of hallucinations remained impossible.

Enterprises must adapt strategies

Experts believed the mathematical inevitability of AI errors demands new enterprise strategies.

“Governance must shift from prevention to risk containment,” Dai said. “This means stronger human-in-the-loop processes, domain-specific guardrails, and continuous monitoring.”

Current AI risk frameworks have proved inadequate for the reality of persistent hallucinations. “Current frameworks often underweight epistemic uncertainty, so updates are needed to address systemic unpredictability,” Dai added.

Shah advocated for industry-wide evaluation reforms similar to automotive safety standards. “Just as automotive components are graded under ASIL standards to ensure safety, AI models should be assigned dynamic grades, nationally and internationally, based on their reliability and risk profile,” he said.

Both analysts agreed that vendor selection criteria needed fundamental revision. “Enterprises should prioritize calibrated confidence and transparency over raw benchmark scores,” Dai said. “AI leaders should look for vendors that provide uncertainty estimates, robust evaluation beyond standard benchmarks, and real-world validation.”

Shah suggested developing “a real-time trust index, a dynamic scoring system that evaluates model outputs based on prompt ambiguity, contextual understanding, and source quality.”

Market already adapting

These enterprise concerns aligned with broader academic findings. A Harvard Kennedy School research found that “downstream gatekeeping struggles to filter subtle hallucinations due to budget, volume, ambiguity, and context sensitivity concerns.”

Dai noted that reforming evaluation standards faced significant obstacles. “Reforming mainstream benchmarks is challenging. It’s only feasible if it’s driven by regulatory pressure, enterprise demand, and competitive differentiation.”

The OpenAI researchers concluded that their findings required industry-wide changes to evaluation methods. “This change may steer the field toward more trustworthy AI systems,” they wrote, while acknowledging that their research proved some level of unreliability would persist regardless of technical improvements.

For enterprises, the message appeared clear: AI hallucinations represented not a temporary engineering challenge, but a permanent mathematical reality requiring new governance frameworks and risk management strategies.

More on AI hallucinations:

You thought genAI hallucinations were bad? Things just got so much worse

Microsoft claims new ‘Correction’ tool can fix genAI hallucinations

AI hallucination mitigation: two brains are better than one

Lire la suite sur ComputerWorld