Inference Index
All benchmarks
Directory · Benchmarks · Math
Math

MATH

Competition-level math problems from AMC/AIME/Olympiad datasets. Tests symbolic reasoning and multi-step solutions.

Metric
Percentage
Max score
100
Maintainer
UC Berkeley (Hendrycks et al.)
Models scored
16

Why it matters

If MMLU is "what do you know," MATH is "can you actually solve things." Frontier reasoning models now saturate it — and the ceiling keeps rising.

Known limitations

Training data contamination is real. Newer evals (HMMT, AIME 2024+) are used when MATH saturation hides differences.

Model rankings

Full leaderboard →