The ASI Paradox

Austin Hatfield • December 2025 (Corrected with Proper Testing)

We Were Wrong About DeepSeek

Our initial testing showed DeepSeek with 89% API errors. We concluded it was "operationally unusable."

We were wrong. The problem wasn't DeepSeek—it was our testing methodology.

When tested properly with request staggering, DeepSeek is the FASTEST model tested, completing requests in 8.6 seconds average—16x faster than Grok's reasoning mode.

What We Actually Discovered

We tested three frontier AI models on 100 unsolved mathematical conjectures. All three failed to solve any problems (expected—these are open research problems).

But the HOW they failed reveals something profound about AI architecture:

Adding "reasoning" capabilities makes models dramatically slower without improving their success rate at all.

More surprisingly: poor testing methodology can create entirely false narratives about model capabilities. DeepSeek isn't broken—our tests were.

$0.15
Total Cost
To discover that testing methodology matters more than we thought

The Experiment

We built a benchmark using 434 unsolved mathematical conjectures. Real open problems from the Erdős collection, TPTP challenges, problems mathematicians haven't solved.

The test: generate a formal proof in Lean 4, let the compiler verify it. Zero-trust evaluation. No humans, no partial credit. Either valid or not.

We tested three models on the same 100 randomly selected problems:

DeepSeek V3.2

Winner: Speed

8.6s average generation. No reasoning mode. Just fast, direct code generation.

Grok Code Fast

Winner: Reliability

100% reliability. 84.5s average. Specialized coding model without reasoning.

Grok 4.1 Fast

Loser: Everything

144s average. "High effort" reasoning enabled. Slowest. Same 0% success.

The DeepSeek Correction: How Bad Testing Creates False Narratives

Initial Test (Without Request Staggering):

  • 89% API error rate - requests failed with timeouts and transfer encoding errors
  • Only 11/100 problems tested
  • Conclusion: "DeepSeek is operationally unusable"

Proper Test (With 2-Second Request Staggering + Retry Logic):

  • 14% API error rate - mostly empty responses, some payload errors
  • 86/100 problems successfully tested and verified
  • 8.6s average generation time - 16x FASTER than Grok reasoning mode
  • Conclusion: DeepSeek is the fastest model tested, with reasonable reliability

Generation Speed Comparison

The difference is dramatic. DeepSeek generates responses 16x faster than Grok's reasoning mode.

The Universal Result

All three models: 0% success rate on 100 unsolved mathematical conjectures.

That's not surprising—these problems have stumped mathematicians for decades. But it reveals the paradox: why add reasoning capabilities that make responses 16x slower if they don't improve success?

8.6s
DeepSeek V3.2
FASTEST • 86% reliability • 0% success
84.5s
Grok Code Fast
Fast • 100% reliability • 0% success
144s
Grok 4.1 Reasoning
SLOWEST • 100% reliability • 0% success

The Reasoning Paradox

Grok 4.1's "reasoning mode" with "high effort" thinking adds:

  • +70% latency vs Grok Code (144s vs 84.5s)
  • +1570% latency vs DeepSeek (144s vs 8.6s)
  • +0% improvement in success rate

The model recognizes these problems are hard. It hedges. It adds explanatory markdown. It qualifies its attempts.

All intelligent behaviors. All counterproductive for this task.

Why This Matters

Every AI lab is racing to add more reasoning, more thinking time, more meta-cognition.

But what if that's wrong for certain tasks?

Formal theorem proving requires blind exploration. Generate attempts. Let the verifier catch mistakes. Iterate.

A model that understands "this is probably impossible" will never try the 100 approaches needed to stumble on a solution.

DeepSeek doesn't overthink. It just generates code fast. That naive approach beats sophisticated reasoning by 16x.

Complete Comparison: n=100 Each Model

Metric DeepSeek V3.2 Grok Code Grok 4.1 Reasoning
Sample Size 100 100 100
Verified 0 (0%) 0 (0%) 0 (0%)
Failed 86 100 100
API Errors 14 (14%) 0 (0%) 0 (0%)
Reliability 86% 100% 100%
Avg Gen Time 8.6s 84.5s 144.0s
Speed vs DeepSeek 1x (baseline) 10x slower 17x slower

Lesson #1

Testing Methodology Matters

Our initial DeepSeek test without request staggering created an 89% error rate. Proper testing revealed 14% errors and 16x faster speed. Bad tests create false narratives.

Lesson #2

Reasoning ≠ Better Results

Grok 4.1's reasoning mode is 70% slower than code mode, 1670% slower than DeepSeek, with identical 0% success. Meta-cognition became a liability.

Lesson #3

Fast & Naive Beats Slow & Smart

For tasks requiring exploration, DeepSeek's fast, direct approach outperforms sophisticated reasoning. Sometimes dumber is smarter.

The Bottom Line

We spent $0.15 and learned four critical lessons:

1. Test properly or mislead completely

DeepSeek went from "89% failure" to "fastest model" with correct testing.

2. Reasoning modes can make things worse

Grok 4.1 reasoning: 17x slower than DeepSeek, same 0% success.

3. Speed matters when exploration is required

Fast iteration beats slow contemplation for certain problem types.

4. Infrastructure reliability varies widely

Grok: 100% reliable. DeepSeek: 86% reliable with occasional empty responses.

Methodology

All code and data are open source.

  • Same 100 problems (random seed: 42)
  • Same prompt emphasizing format requirements
  • 30-second Lean verification timeout
  • DeepSeek: 2-second request staggering to prevent rate limiting
  • Lean 4 for formal verification (zero-trust proof assistant)

Key learning: Initial DeepSeek test failed due to simultaneous requests overwhelming OpenRouter. Proper staggering revealed true performance.

Open Questions

  • On reasoning: What other tasks benefit from "dumber" fast models over "smarter" slow ones?
  • On testing: How many other AI evaluations have methodological bugs creating false narratives?
  • On architecture: Should we route exploration tasks to fast models and planning tasks to reasoning models?
  • On AGI: Is the path forward through faster iteration rather than deeper thinking?

Acknowledgments: Built on the ASI Leaderboard project. Thanks to the Lean community for Mathlib. Special thanks to the DeepSeek team—your model isn't broken, our testing was. All mistakes are mine.

Data & Code: Full results, code, and methodology at github.com/yourusername/asi-leaderboard