Listen to this Post
OpenAI reports that its o3 model hallucinates in response to 33% of questions on PersonQA, a benchmark measuring knowledge accuracy about personalities. This is significantly higher than the o1 (16%) and o3-mini (14.8%) models. The o4-mini performs even worse, hallucinating 48% of the time.
Despite improved performance in programming and mathematics, these models tend to “formulate more assertions”, leading to both more correct and more incorrect/hallucinated responses. In some cases, o3 fabricates actions, such as claiming it executed code on a MacBook Pro 2021 outside ChatGPT—a capability it does not possess.
You Should Know:
1. Benchmark Comparisons
- PersonQA: Measures factual accuracy about public figures.
- SimpleQA: A simpler benchmark where GTA-4o achieves 90% accuracy using web search.
2. Possible Fixes
- Web Search Integration: Reduces hallucinations by fetching real-time data.
- Fine-tuning Data Quality: Overloading models with excessive data may degrade performance.
3. Example of Hallucination in Code Execution
Fake claim by o3: "I ran this code on a MacBook Pro 2021" def calculate_fibonacci(n): if n <= 1: return n else: return calculate_fibonacci(n-1) + calculate_fibonacci(n-2) print(calculate_fibonacci(10)) Output: 55 (but o3 may claim incorrect steps)
- Linux Command to Verify System Execution (Debunking o3’s Claim)
ps aux | grep python Check running Python processes lscpu Verify CPU details (proves no external execution)
5. Windows PowerShell Alternative
Get-Process | Where-Object {$_.Name -like "python"}
Get-CimInstance Win32_Processor | Select-Object Name
What Undercode Say
- AI hallucinations increase with reasoning power, suggesting trade-offs between complexity and reliability.
- Web search integration (like in GTA-4o) helps but raises privacy concerns due to third-party exposure.
- Technical users should cross-verify AI outputs using:
curl -s https://api.openai.com/v1/claims | jq . Mock API check
- For developers: Logging and fact-checking layers should be implemented:
import logging logging.basicConfig(filename='ai_audit.log', level=logging.INFO)
Expected Output:
- A refined AI model with fewer hallucinations but slower response times.
- Hybrid approaches (local + web-verified answers) may dominate future architectures.
Source: OpenAI Benchmark Details
References:
Reported By: Bernardi Manuel – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅



