OpenAI's O And O-mini Models Exhibit Unusual Levels Of Hallucinations

OpenAI reports that its o3 model hallucinates in response to 33% of questions on PersonQA, a benchmark measuring knowledge accuracy about personalities. This is significantly higher than the o1 (16%) and o3-mini (14.8%) models. The o4-mini performs even worse, hallucinating 48% of the time.

Despite improved performance in programming and mathematics, these models tend to “formulate more assertions”, leading to both more correct and more incorrect/hallucinated responses. In some cases, o3 fabricates actions, such as claiming it executed code on a MacBook Pro 2021 outside ChatGPT—a capability it does not possess.

You Should Know:

1. Benchmark Comparisons

PersonQA: Measures factual accuracy about public figures.
SimpleQA: A simpler benchmark where GTA-4o achieves 90% accuracy using web search.

2. Possible Fixes

Web Search Integration: Reduces hallucinations by fetching real-time data.
Fine-tuning Data Quality: Overloading models with excessive data may degrade performance.

3. Example of Hallucination in Code Execution

 Fake claim by o3: "I ran this code on a MacBook Pro 2021" 
def calculate_fibonacci(n): 
if n <= 1: 
return n 
else: 
return calculate_fibonacci(n-1) + calculate_fibonacci(n-2) 
print(calculate_fibonacci(10))  Output: 55 (but o3 may claim incorrect steps)

Linux Command to Verify System Execution (Debunking o3’s Claim)

ps aux | grep python  Check running Python processes 
lscpu  Verify CPU details (proves no external execution)

5. Windows PowerShell Alternative

Get-Process | Where-Object {$_.Name -like "python"} 
Get-CimInstance Win32_Processor | Select-Object Name

What Undercode Say

AI hallucinations increase with reasoning power, suggesting trade-offs between complexity and reliability.
Web search integration (like in GTA-4o) helps but raises privacy concerns due to third-party exposure.

Technical users should cross-verify AI outputs using:

curl -s https://api.openai.com/v1/claims | jq .  Mock API check

For developers: Logging and fact-checking layers should be implemented:

import logging 
logging.basicConfig(filename='ai_audit.log', level=logging.INFO)

Expected Output:

A refined AI model with fewer hallucinations but slower response times.
Hybrid approaches (local + web-verified answers) may dominate future architectures.

Source: OpenAI Benchmark Details