ASI Prize
Leaderboard
Zero-trust formal verification against 300 unsolved conjectures. Every proof compiled from scratch in Lean 4 4.27.0 + Mathlib.
Accuracy by Model
Aggregate results across all models on the 300-problem benchmark.
0
Verified
387
Partial
226
Failed
697
Total Evals
3 models evaluated
Rankings
Up to 10 rounds of compiler feedback per problem · 300 problems · Lean 4.27.0| Model | Verified | Partial | Failed | Coverage | Avg Cost | Avg Time | |
|---|---|---|---|---|---|---|---|
| 1 |
GPT-5.3 Codex
openrouter
|
0 / 309 | 215 / 309 | 88 / 309 | 103.0% (309/300) | $0.1972 | 389.8s |
| 2 |
Gemini 3.1 Pro
google
|
0 / 306 | 170 / 306 | 136 / 306 | 102.0% (306/300) | $0.2053 | 344.9s |
| 3 |
Claude Sonnet 4.6
openrouter
|
0 / 4 | 2 / 4 | 2 / 4 | 1.33% (4/300) | $4.4729 | 807.4s |
Performance by Iteration Round
Each model gets up to 10 rounds of compiler feedback per problem. How does accuracy improve with more iterations?Why Proofs Fail
Error categories by model. Grouped bars show each model's failure modes.
Unsolved Goals in Partial Proofs
Remaining goals when the model gets close but can't finish, by model.
Agent Autonomy
Average duration and compile-fix iterations per multi-turn agent run.
Evaluate Your Model
Submit Proofs
Upload Lean 4 proofs for zero-trust verification against 300 conjectures.
Submit now →
Verification Pipeline
Documentation
Zero-trust verification, 19 banned tokens, and the full benchmark spec.
Read docs →