OpenAI’s o3, o4-mini Hallucinate More Than Older Models, Benchmark Test Shows

According to IBM, hallucination is when an AI model sees patterns that don’t exist, generating inaccurate or nonsensical outputs.

New York – As artificial intelligence becomes increasingly integrated into daily life, powering everything from search engines to customer support and creative tools, a critical flaw continues to cast a shadow over its advancement: hallucination. Despite improvements in language fluency and reasoning, leading AI models are still prone to generating false or nonsensical information—a problem experts say could undermine trust and usability.

According to IBM, hallucination occurs when a large language model (LLM), often in the form of a generative AI chatbot or computer vision tool, “perceives patterns or objects that are nonexistent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate”.

The issue has resurfaced in a technical report by OpenAI, which evaluated its latest models—o3 and o4-mini—against earlier generations like o1, o1-mini, and o3-mini, along with GPT-4o, a model designed with limited reasoning capabilities. The findings are surprising: the newer models are more likely to hallucinate than their predecessors.

To benchmark hallucination, OpenAI used PersonQA, a specialized dataset of factual questions meant to assess a model’s ability to generate accurate answers.

“PersonQA is a dataset of questions and publicly available facts that measures the model’s accuracy on attempted answers”, the report notes.

The results raised eyebrows across the AI research community. The o3 model produced hallucinated answers in 33% of test cases—nearly double the rates of o1 (16%) and o3-mini (14.8%). The o4-mini model performed even worse, hallucinating in 48% of the responses.

What’s fueling this backward slide in factual accuracy is still unclear. The report did not provide a definitive reason for the spike in hallucinations, only stating that “more research” is needed to understand the issue.

This uncertainty raises broader questions about the trade-offs between scale, reasoning ability, and factual reliability. If larger, more complex models are more prone to hallucination, researchers may face tougher challenges ahead in improving the trustworthiness of AI.

“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability”, said OpenAI spokesperson Niko Felix in a statement to TechCrunch.

As AI continues to evolve rapidly, striking the right balance between intelligence and reliability may determine the technology’s future adoption in critical domains such as healthcare, law, and education—where accuracy isn’t just important, it’s essential.

Recent News

Kabul Raises Alarm Over Forced Repatriation of Afghans in Meeting With Pakistani FM

Kabul/Islamabad: In a rare diplomatic engagement, the Taliban's acting foreign minister, Amir Khan Muttaqi, conveyed “deep concern and sadness” to Pakistan’s Foreign Minister Ishaq...

Putin Declares Easter Truce as Russia and Ukraine Conduct Largest Prisoner Exchange of the War

Moscow/kyiv: Russian President Vladimir Putin announced a temporary ceasefire in the ongoing conflict with Ukraine on Saturday, coinciding with the Easter holiday. The truce,...

India on Musk’s Radar: Tesla CEO Plans Visit After PM Modi Call

New York: Elon Musk announced on Saturday that he plans to visit India later this year, shortly after a conversation with Indian Prime Minister...

Fate of U.S.-Israeli Soldier Edan Alexander Unclear, Says Hamas

Jerusalem: Hamas' armed wing announced on Saturday that the fate of Israeli-American hostage Edan Alexander is currently unknown, following the death of the guard...