←All benchmarks
Directory · Benchmarks · Math
MathGSM8K
8.5K grade-school word problems. Saturated now, but still used as a smoke test for arithmetic reasoning.
Metric
Percentage
Max score
100
Maintainer
OpenAI
Models scored
16
Why it matters
For years, GSM8K was the test. It’s now saturated on frontier models (99%+), but still appears in every model card.
Known limitations
Saturated. Use MATH or AIME for any meaningful discrimination.
Model rankings
Full leaderboard →- 0196.2%o4
- 0294.1%DeepSeek R2
- 0392.0%Gemini 2.5 Ultra
- 0491.5%GPT-5 Turbo
- 0589.3%Claude Opus 4.7
- 0688.7%DeepSeek V4
- 0786.5%Grok 4
- 0885.1%Qwen 3 Max
- 0985.0%Claude Sonnet 4.6
- 1084.2%Gemini 2.5 Flash
- 1182.0%Phi-5
- 1281.5%Llama 4 Behemoth
- 1379.8%GPT-5 Mini
- 1478.2%Mistral X-Large
- 1576.8%Claude Haiku 4.5
- 1672.5%Llama 4 Scout