OpenAI’s o and o-mini Models Exhibit Unusual Levels of Hallucinations

Listen to this Post

OpenAI reports that its o3 model hallucinates in response to 33% of questions on PersonQA, a benchmark measuring knowledge accuracy about personalities. This is significantly higher than the o1 (16%) and o3-mini (14.8%) models. The o4-mini performs even worse, hallucinating 48% of the time.

Despite improved performance in programming and mathematics, these models tend to “formulate more assertions”, leading to both more correct and more incorrect/hallucinated responses. In some cases, o3 fabricates actions, such as claiming it executed code on a MacBook Pro 2021 outside ChatGPT—a capability it does not possess.

You Should Know:

1. Benchmark Comparisons

  • PersonQA: Measures factual accuracy about public figures.
  • SimpleQA: A simpler benchmark where GTA-4o achieves 90% accuracy using web search.

2. Possible Fixes

  • Web Search Integration: Reduces hallucinations by fetching real-time data.
  • Fine-tuning Data Quality: Overloading models with excessive data may degrade performance.

3. Example of Hallucination in Code Execution

 Fake claim by o3: "I ran this code on a MacBook Pro 2021" 
def calculate_fibonacci(n): 
if n <= 1: 
return n 
else: 
return calculate_fibonacci(n-1) + calculate_fibonacci(n-2) 
print(calculate_fibonacci(10))  Output: 55 (but o3 may claim incorrect steps) 
  1. Linux Command to Verify System Execution (Debunking o3’s Claim)
    ps aux | grep python  Check running Python processes 
    lscpu  Verify CPU details (proves no external execution) 
    

5. Windows PowerShell Alternative

Get-Process | Where-Object {$_.Name -like "python"} 
Get-CimInstance Win32_Processor | Select-Object Name 

What Undercode Say

  • AI hallucinations increase with reasoning power, suggesting trade-offs between complexity and reliability.
  • Web search integration (like in GTA-4o) helps but raises privacy concerns due to third-party exposure.
  • Technical users should cross-verify AI outputs using:
    curl -s https://api.openai.com/v1/claims | jq .  Mock API check 
    
  • For developers: Logging and fact-checking layers should be implemented:
    import logging 
    logging.basicConfig(filename='ai_audit.log', level=logging.INFO) 
    

Expected Output:

  • A refined AI model with fewer hallucinations but slower response times.
  • Hybrid approaches (local + web-verified answers) may dominate future architectures.

Source: OpenAI Benchmark Details

References:

Reported By: Bernardi Manuel – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅

Join Our Cyber World:

💬 Whatsapp | 💬 TelegramFeatured Image