OpenAI’s o3, o4-mini Hallucinate More Than Older Models, Benchmark Test Shows

According to IBM, hallucination is when an AI model sees patterns that don’t exist, generating inaccurate or nonsensical outputs.

New York – As artificial intelligence becomes increasingly integrated into daily life, powering everything from search engines to customer support and creative tools, a critical flaw continues to cast a shadow over its advancement: hallucination. Despite improvements in language fluency and reasoning, leading AI models are still prone to generating false or nonsensical information—a problem experts say could undermine trust and usability.

According to IBM, hallucination occurs when a large language model (LLM), often in the form of a generative AI chatbot or computer vision tool, “perceives patterns or objects that are nonexistent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate”.

The issue has resurfaced in a technical report by OpenAI, which evaluated its latest models—o3 and o4-mini—against earlier generations like o1, o1-mini, and o3-mini, along with GPT-4o, a model designed with limited reasoning capabilities. The findings are surprising: the newer models are more likely to hallucinate than their predecessors.

To benchmark hallucination, OpenAI used PersonQA, a specialized dataset of factual questions meant to assess a model’s ability to generate accurate answers.

“PersonQA is a dataset of questions and publicly available facts that measures the model’s accuracy on attempted answers”, the report notes.

The results raised eyebrows across the AI research community. The o3 model produced hallucinated answers in 33% of test cases—nearly double the rates of o1 (16%) and o3-mini (14.8%). The o4-mini model performed even worse, hallucinating in 48% of the responses.

What’s fueling this backward slide in factual accuracy is still unclear. The report did not provide a definitive reason for the spike in hallucinations, only stating that “more research” is needed to understand the issue.

This uncertainty raises broader questions about the trade-offs between scale, reasoning ability, and factual reliability. If larger, more complex models are more prone to hallucination, researchers may face tougher challenges ahead in improving the trustworthiness of AI.

“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability”, said OpenAI spokesperson Niko Felix in a statement to TechCrunch.

As AI continues to evolve rapidly, striking the right balance between intelligence and reliability may determine the technology’s future adoption in critical domains such as healthcare, law, and education—where accuracy isn’t just important, it’s essential.

Recent News

U.S. Tariffs Expected to Push Up Core Consumer Prices in May

Washington: U.S. consumer prices likely saw a modest rise in May, with falling gasoline prices offering some relief. However, economists warn that the Trump...

BTS Stars Jimin and Jungkook Complete South Korean Military Service, Fueling Comeback Hopes

Yeoncheon: BTS members Jimin and Jungkook were officially discharged from South Korea’s military on Wednesday, becoming the fifth and sixth members of the iconic...

Criminal Gangs Use Drones and Social Media to Push Illegal Cigarettes Across Europe

London: Criminal networks across Europe are increasingly deploying modern technology — including drones, budget airlines, and social media platforms — to smuggle and sell...

3,000 Arrests a Day: Inside ICE’s New Quota-Driven Deportation Strategy

Washington: U.S. Immigration and Customs Enforcement (ICE) has significantly ramped up its operations across the country, tripling its daily arrest target from 1,000 to...