Inference Index

←All benchmarks

Directory · Benchmarks · Math

Math

MATH

Competition-level math problems from AMC/AIME/Olympiad datasets. Tests symbolic reasoning and multi-step solutions.

Metric

Percentage

Max score

100

Maintainer

UC Berkeley (Hendrycks et al.)

Models scored

16

Why it matters

If MMLU is "what do you know," MATH is "can you actually solve things." Frontier reasoning models now saturate it — and the ceiling keeps rising.

Known limitations

Training data contamination is real. Newer evals (HMMT, AIME 2024+) are used when MATH saturation hides differences.

Model rankings

Full leaderboard →