Inference Index
All benchmarks
Directory · Benchmarks · Math
Math

GSM8K

8.5K grade-school word problems. Saturated now, but still used as a smoke test for arithmetic reasoning.

Metric
Percentage
Max score
100
Maintainer
OpenAI
Models scored
16

Why it matters

For years, GSM8K was the test. It’s now saturated on frontier models (99%+), but still appears in every model card.

Known limitations

Saturated. Use MATH or AIME for any meaningful discrimination.

Model rankings

Full leaderboard →