Inference Index

←All benchmarks

Directory · Benchmarks · Math

Math

GSM8K

8.5K grade-school word problems. Saturated now, but still used as a smoke test for arithmetic reasoning.

Metric

Percentage

Max score

100

Maintainer

OpenAI

Models scored

16

Why it matters

For years, GSM8K was the test. It’s now saturated on frontier models (99%+), but still appears in every model card.

Known limitations

Saturated. Use MATH or AIME for any meaningful discrimination.

Model rankings

Full leaderboard →