The ASI Paradox
Austin Hatfield • December 2025 (Corrected with Proper Testing)
We Were Wrong About DeepSeek
Our initial testing showed DeepSeek with 89% API errors. We concluded it was "operationally unusable."
We were wrong. The problem wasn't DeepSeek—it was our testing methodology.
When tested properly with request staggering, DeepSeek is the FASTEST model tested, completing requests in 8.6 seconds average—16x faster than Grok's reasoning mode.
What We Actually Discovered
We tested three frontier AI models on 100 unsolved mathematical conjectures. All three failed to solve any problems (expected—these are open research problems).
But the HOW they failed reveals something profound about AI architecture:
Adding "reasoning" capabilities makes models dramatically slower without improving their success rate at all.
More surprisingly: poor testing methodology can create entirely false narratives about model capabilities. DeepSeek isn't broken—our tests were.
The Experiment
We built a benchmark using 434 unsolved mathematical conjectures. Real open problems from the Erdős collection, TPTP challenges, problems mathematicians haven't solved.
The test: generate a formal proof in Lean 4, let the compiler verify it. Zero-trust evaluation. No humans, no partial credit. Either valid or not.
We tested three models on the same 100 randomly selected problems:
DeepSeek V3.2
Winner: Speed
8.6s average generation. No reasoning mode. Just fast, direct code generation.
Grok Code Fast
Winner: Reliability
100% reliability. 84.5s average. Specialized coding model without reasoning.
Grok 4.1 Fast
Loser: Everything
144s average. "High effort" reasoning enabled. Slowest. Same 0% success.
The DeepSeek Correction: How Bad Testing Creates False Narratives
Initial Test (Without Request Staggering):
- 89% API error rate - requests failed with timeouts and transfer encoding errors
- Only 11/100 problems tested
- Conclusion: "DeepSeek is operationally unusable"
Proper Test (With 2-Second Request Staggering + Retry Logic):
- 14% API error rate - mostly empty responses, some payload errors
- 86/100 problems successfully tested and verified
- 8.6s average generation time - 16x FASTER than Grok reasoning mode
- Conclusion: DeepSeek is the fastest model tested, with reasonable reliability
Generation Speed Comparison
The difference is dramatic. DeepSeek generates responses 16x faster than Grok's reasoning mode.
The Universal Result
All three models: 0% success rate on 100 unsolved mathematical conjectures.
That's not surprising—these problems have stumped mathematicians for decades. But it reveals the paradox: why add reasoning capabilities that make responses 16x slower if they don't improve success?
The Reasoning Paradox
Grok 4.1's "reasoning mode" with "high effort" thinking adds:
- +70% latency vs Grok Code (144s vs 84.5s)
- +1570% latency vs DeepSeek (144s vs 8.6s)
- +0% improvement in success rate
The model recognizes these problems are hard. It hedges. It adds explanatory markdown. It qualifies its attempts.
All intelligent behaviors. All counterproductive for this task.
Why This Matters
Every AI lab is racing to add more reasoning, more thinking time, more meta-cognition.
But what if that's wrong for certain tasks?
Formal theorem proving requires blind exploration. Generate attempts. Let the verifier catch mistakes. Iterate.
A model that understands "this is probably impossible" will never try the 100 approaches needed to stumble on a solution.
DeepSeek doesn't overthink. It just generates code fast. That naive approach beats sophisticated reasoning by 16x.
Complete Comparison: n=100 Each Model
| Metric | DeepSeek V3.2 | Grok Code | Grok 4.1 Reasoning |
|---|---|---|---|
| Sample Size | 100 | 100 | 100 |
| Verified | 0 (0%) | 0 (0%) | 0 (0%) |
| Failed | 86 | 100 | 100 |
| API Errors | 14 (14%) | 0 (0%) | 0 (0%) |
| Reliability | 86% | 100% | 100% |
| Avg Gen Time | 8.6s | 84.5s | 144.0s |
| Speed vs DeepSeek | 1x (baseline) | 10x slower | 17x slower |
Lesson #1
Testing Methodology Matters
Our initial DeepSeek test without request staggering created an 89% error rate. Proper testing revealed 14% errors and 16x faster speed. Bad tests create false narratives.
Lesson #2
Reasoning ≠ Better Results
Grok 4.1's reasoning mode is 70% slower than code mode, 1670% slower than DeepSeek, with identical 0% success. Meta-cognition became a liability.
Lesson #3
Fast & Naive Beats Slow & Smart
For tasks requiring exploration, DeepSeek's fast, direct approach outperforms sophisticated reasoning. Sometimes dumber is smarter.
The Bottom Line
We spent $0.15 and learned four critical lessons:
1. Test properly or mislead completely
DeepSeek went from "89% failure" to "fastest model" with correct testing.
2. Reasoning modes can make things worse
Grok 4.1 reasoning: 17x slower than DeepSeek, same 0% success.
3. Speed matters when exploration is required
Fast iteration beats slow contemplation for certain problem types.
4. Infrastructure reliability varies widely
Grok: 100% reliable. DeepSeek: 86% reliable with occasional empty responses.
Methodology
All code and data are open source.
- Same 100 problems (random seed: 42)
- Same prompt emphasizing format requirements
- 30-second Lean verification timeout
- DeepSeek: 2-second request staggering to prevent rate limiting
- Lean 4 for formal verification (zero-trust proof assistant)
Key learning: Initial DeepSeek test failed due to simultaneous requests overwhelming OpenRouter. Proper staggering revealed true performance.
Open Questions
- On reasoning: What other tasks benefit from "dumber" fast models over "smarter" slow ones?
- On testing: How many other AI evaluations have methodological bugs creating false narratives?
- On architecture: Should we route exploration tasks to fast models and planning tasks to reasoning models?
- On AGI: Is the path forward through faster iteration rather than deeper thinking?
∎
Acknowledgments: Built on the ASI Leaderboard project. Thanks to the Lean community for Mathlib. Special thanks to the DeepSeek team—your model isn't broken, our testing was. All mistakes are mine.
Data & Code: Full results, code, and methodology at github.com/yourusername/asi-leaderboard