OpenAI’s o3, o4-mini Hallucinate More Than Older Models, Benchmark Test Shows

According to IBM, hallucination is when an AI model sees patterns that don’t exist, generating inaccurate or nonsensical outputs.

New York – As artificial intelligence becomes increasingly integrated into daily life, powering everything from search engines to customer support and creative tools, a critical flaw continues to cast a shadow over its advancement: hallucination. Despite improvements in language fluency and reasoning, leading AI models are still prone to generating false or nonsensical information—a problem experts say could undermine trust and usability.

According to IBM, hallucination occurs when a large language model (LLM), often in the form of a generative AI chatbot or computer vision tool, “perceives patterns or objects that are nonexistent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate”.

The issue has resurfaced in a technical report by OpenAI, which evaluated its latest models—o3 and o4-mini—against earlier generations like o1, o1-mini, and o3-mini, along with GPT-4o, a model designed with limited reasoning capabilities. The findings are surprising: the newer models are more likely to hallucinate than their predecessors.

To benchmark hallucination, OpenAI used PersonQA, a specialized dataset of factual questions meant to assess a model’s ability to generate accurate answers.

“PersonQA is a dataset of questions and publicly available facts that measures the model’s accuracy on attempted answers”, the report notes.

The results raised eyebrows across the AI research community. The o3 model produced hallucinated answers in 33% of test cases—nearly double the rates of o1 (16%) and o3-mini (14.8%). The o4-mini model performed even worse, hallucinating in 48% of the responses.

What’s fueling this backward slide in factual accuracy is still unclear. The report did not provide a definitive reason for the spike in hallucinations, only stating that “more research” is needed to understand the issue.

This uncertainty raises broader questions about the trade-offs between scale, reasoning ability, and factual reliability. If larger, more complex models are more prone to hallucination, researchers may face tougher challenges ahead in improving the trustworthiness of AI.

“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability”, said OpenAI spokesperson Niko Felix in a statement to TechCrunch.

As AI continues to evolve rapidly, striking the right balance between intelligence and reliability may determine the technology’s future adoption in critical domains such as healthcare, law, and education—where accuracy isn’t just important, it’s essential.

Recent News

Kennedy Fires Entire U.S. Vaccine Panel Over Alleged Pharma Conflicts

Washington: U.S. Health Secretary Robert F. Kennedy Jr.'s abrupt removal of all 17 members of the Advisory Committee on Immunization Practices (ACIP) has cast...

Thai Ruling Party Faces Border Crisis, Court Case Against Thaksin Amid Economic Woes

Bangkok: Thailand's ruling Pheu Thai party is confronting a perfect storm of challenges—from rising economic distress and border tensions to a legal battle that...

Poland Arrests Teen Suspected of Plotting Terror Attack, Inspired by Notorious Mass-Killers

Warsaw: Polish authorities have detained a 19-year-old man suspected of preparing a terrorist attack, according to Interior Ministry spokesman Jacek Dobrzynski. In a statement posted...

Rare Earth Rush: Chinese-Backed Militias Open New Mines in Myanmar

Bangkok: A Chinese-aligned militia is now overseeing operations at newly opened rare earth mines in eastern Myanmar, four sources confirmed, marking a fresh strategic...