OpenAI’s o3, o4-mini Hallucinate More Than Older Models, Benchmark Test Shows

According to IBM, hallucination is when an AI model sees patterns that don’t exist, generating inaccurate or nonsensical outputs.

New York – As artificial intelligence becomes increasingly integrated into daily life, powering everything from search engines to customer support and creative tools, a critical flaw continues to cast a shadow over its advancement: hallucination. Despite improvements in language fluency and reasoning, leading AI models are still prone to generating false or nonsensical information—a problem experts say could undermine trust and usability.

According to IBM, hallucination occurs when a large language model (LLM), often in the form of a generative AI chatbot or computer vision tool, “perceives patterns or objects that are nonexistent or imperceptible to human observers, creating outputs that are nonsensical or altogether inaccurate”.

The issue has resurfaced in a technical report by OpenAI, which evaluated its latest models—o3 and o4-mini—against earlier generations like o1, o1-mini, and o3-mini, along with GPT-4o, a model designed with limited reasoning capabilities. The findings are surprising: the newer models are more likely to hallucinate than their predecessors.

To benchmark hallucination, OpenAI used PersonQA, a specialized dataset of factual questions meant to assess a model’s ability to generate accurate answers.

“PersonQA is a dataset of questions and publicly available facts that measures the model’s accuracy on attempted answers”, the report notes.

The results raised eyebrows across the AI research community. The o3 model produced hallucinated answers in 33% of test cases—nearly double the rates of o1 (16%) and o3-mini (14.8%). The o4-mini model performed even worse, hallucinating in 48% of the responses.

What’s fueling this backward slide in factual accuracy is still unclear. The report did not provide a definitive reason for the spike in hallucinations, only stating that “more research” is needed to understand the issue.

This uncertainty raises broader questions about the trade-offs between scale, reasoning ability, and factual reliability. If larger, more complex models are more prone to hallucination, researchers may face tougher challenges ahead in improving the trustworthiness of AI.

“Addressing hallucinations across all our models is an ongoing area of research, and we’re continually working to improve their accuracy and reliability”, said OpenAI spokesperson Niko Felix in a statement to TechCrunch.

As AI continues to evolve rapidly, striking the right balance between intelligence and reliability may determine the technology’s future adoption in critical domains such as healthcare, law, and education—where accuracy isn’t just important, it’s essential.

OpenAI’s o3, o4-mini Hallucinate More Than Older Models, Benchmark Test Shows

Recent News

U.S. Tariffs Expected to Push Up Core Consumer Prices in May

BTS Stars Jimin and Jungkook Complete South Korean Military Service, Fueling Comeback Hopes

Criminal Gangs Use Drones and Social Media to Push Illegal Cigarettes Across Europe

3,000 Arrests a Day: Inside ICE’s New Quota-Driven Deportation Strategy