←All benchmarks
Directory · Benchmarks · Math
MathMATH
Competition-level math problems from AMC/AIME/Olympiad datasets. Tests symbolic reasoning and multi-step solutions.
Metric
Percentage
Max score
100
Maintainer
UC Berkeley (Hendrycks et al.)
Models scored
16
Why it matters
If MMLU is "what do you know," MATH is "can you actually solve things." Frontier reasoning models now saturate it — and the ceiling keeps rising.
Known limitations
Training data contamination is real. Newer evals (HMMT, AIME 2024+) are used when MATH saturation hides differences.
Model rankings
Full leaderboard →- 0196.2%o4
- 0294.1%DeepSeek R2
- 0392.0%Gemini 2.5 Ultra
- 0491.5%GPT-5 Turbo
- 0589.3%Claude Opus 4.7
- 0688.7%DeepSeek V4
- 0786.5%Grok 4
- 0885.1%Qwen 3 Max
- 0985.0%Claude Sonnet 4.6
- 1084.2%Gemini 2.5 Flash
- 1182.0%Phi-5
- 1281.5%Llama 4 Behemoth
- 1379.8%GPT-5 Mini
- 1478.2%Mistral X-Large
- 1576.8%Claude Haiku 4.5
- 1672.5%Llama 4 Scout