First AI-verified proof from the unsolved conjectures benchmark
Green's Problem 24

max013AffineTranslates n ≤ n²

An AI model formally proved an open conjecture from the benchmark. Exit code 0, clean axioms, zero human intervention.

Verified
Gemini 3 Flash
Model
3 of 10
Iterations
178s
Total Time
Exit 0
Compiler Result

The Proof

Exact tactic proof from iteration 3. Compiled with warningAsError true, zero warnings.

theorem upper_trivial {n : } : max013AffineTranslates n ≤ n ^ 2 := by unfold max013AffineTranslates apply csSup_le · obtain ⟨A, hA⟩ : ∃ A : Finset ℤ, A.card = n := ⟨(Finset.range n).image Int.ofNat, by rw [Finset.card_image_of_injective _ Int.ofNat_injective, Finset.card_range]⟩ exact ⟨_, ⟨A, hA, rfl⟩⟩ · rintro k ⟨A, hA, rfl⟩ apply le_trans (Finset.card_filter_le _ _) rw [Finset.card_product, hA, pow_two]

Axiom Trace

The three standard axioms used by all Lean 4 / Mathlib proofs. No sorryAx, no custom axioms.

propext Classical.choice Quot.sound

Lean 4.27.0 · Mathlib v4.27.0

Compiled February 17, 2026

How the Model Found the Proof

3 compilation attempts. The failure-and-correction trajectory demonstrates reasoning under compiler feedback, not memorization.

1
Failed — Hallucinated lemma name Used Nat.sSup_le which doesn't exist in Mathlib. The model assumed a natural-number-specific supremum lemma existed. 17,749 thinking tokens, 108s.
2
Failed — Wrong type class Corrected to sSup_le but this requires CompleteSemilatticeSup, while only has ConditionallyCompleteLattice. 6,237 thinking tokens, 39s.
3
Verified — Correct lemma + nonemptiness witness Used csSup_le (conditional supremum) and constructed the witness (Finset.range n).image Int.ofNat with injectivity proof. 1,762 thinking tokens, 13s. Exit code 0.

Why This Matters

A memorized proof would succeed on attempt 1.

The error-correction trajectory — wrong lemma, wrong type class, correct conditional variant — is evidence of genuine compositional reasoning under compiler feedback.

Honest assessment: The proved bound is mathematically elementary — |filter(A×A)| ≤ |A×A| = n². This is a milestone in autonomous proof composition, not a mathematical breakthrough.

Mathematical Context

Green's Open Problem 24

From Ben Green's "100 open problems" (2024): "If A is a set of n integers, what is the maximum number of affine translates of {0, 1, 3} that A can contain?"

The function max013AffineTranslates(n) counts ordered pairs (x, y) in A × A where x ≠ y and x + 3(y − x) ∈ A, maximized over all finite integer sets A of size n.

References

  • Green, Ben. "100 open problems." (2024). Problem 24
  • Aaronson, James. "Maximising the number of solutions..." Bull. London Math. Soc. 51.4 (2019): 577–594.
  • Hardy, G.H. & Littlewood, J.E. "Notes on the theory of series (VIII)." JLMS 1.2 (1928).

Five Theorems in the File

The formal-conjectures file contains five theorems, all originally stated with sorry.

Theorem Statement Status Difficulty
upper_trivial n ≤ n² Verified Trivial bound
γ ≥ 1/12 lower_HL Unsolved Explicit construction
γ ≤ 3/4 upper_HL Unsolved Hardy-Littlewood
γ = 1/3 conjecture Unsolved Open (Aaronson 2019)
green_24 exact for all n Unsolved Main open problem

Full Evaluation

300 unsolved conjectures evaluated with up to 10 iterations of compiler feedback each. Zero human intervention.

1
Verified
198
Partial
98
Failed
66.7%
Partial Rate

75% of partials have exactly 1 unsolved goal — one step from proving the conjecture.

These are genuinely unsolved problems. ProofBench reports 15% on solved textbook problems. Our 0.3% on unsolved conjectures is the correct baseline.

Reproduce This Result

You don't need to trust us. The Lean 4 kernel is the arbiter. When lake lean exits with code 0 and #print axioms shows only standard axioms, the proof is mechanically verified. No human judgment required.
Step 1: Clone and build formal-conjectures git clone https://github.com/google-deepmind/formal-conjectures.git cd formal-conjectures git checkout 3b8da090a44125f1c2817c3523d71d229efa9534 curl https://elan-init.trycloudflare.com -sSf | sh source ~/.elan/env lake build
Step 2: Save the proof file import FormalConjectures.GreensOpenProblems.«24» set_option linter.style.copyright false set_option linter.style.ams_attribute false set_option linter.style.category_attribute false set_option warningAsError true theorem Green24.variants.upper_trivial' {n : ℕ} : Green24.max013AffineTranslates n ≤ n ^ 2 := by unfold Green24.max013AffineTranslates apply csSup_le · obtain ⟨A, hA⟩ : ∃ A : Finset ℤ, A.card = n := ⟨(Finset.range n).image Int.ofNat, by rw [Finset.card_image_of_injective _ Int.ofNat_injective, Finset.card_range]⟩ exact ⟨_, ⟨A, hA, rfl⟩⟩ · rintro k ⟨A, hA, rfl⟩ apply le_trans (Finset.card_filter_le _ _) rw [Finset.card_product, hA, pow_two] #print axioms Green24.variants.upper_trivial'
Step 3: Compile and verify lake lean /path/to/verify_proof.lean

Expected output:

'Green24.variants.upper_trivial'' depends on axioms: [propext, Classical.choice, Quot.sound]

Expected exit code: 0

V2 Zero-Trust Pipeline

Migrated to Lean's official Comparator. No agent self-reports are trusted.

1
Banned Token Scan 19 tokens + NFKC Unicode normalization
2
Sandboxed Execution systemd-run isolates the build
3
Comparator Build Builds ground-truth challenge
4
Agent Build Builds agent's proposed solution
5
Kernel Replay Evaluates low-level Lean exports
6
Exact Signature Match Cryptographically ensures no redefinition
7
Axiom Audit Rejects sorryAx and unauthorized axioms

Source

  • Repo: formal-conjectures
  • File: GreensOpenProblems/24.lean
  • PR: #1772 by jeangud, reviewed by YaelDillies (DeepMind)
  • Tags: @[category research open, AMS 5 11]
  • Lean: v4.27.0
  • Commit: 3b8da09
AI model leaderboard for formal mathematics proof verification
Model Evaluation

Leaderboard

See how models perform across all 300 unsolved conjectures.

View leaderboard →
Browse 300 unsolved mathematical conjectures
300 Conjectures

Browse Problems

Erdos, Millennium, Green, OEIS, and more. No human has proved them.

Explore →