←All benchmarks
Directory · Benchmarks · Reasoning
ReasoningARC-AGI
Grid puzzles testing abstraction and analogical reasoning. Easy for humans, brutally hard for LLMs.
Metric
Percentage
Max score
100
Maintainer
François Chollet / ARC Prize
Models scored
16
Why it matters
The most famous test that frontier models still struggle on. A canary for whether a model has actually generalized.
Known limitations
Small test set, narrow domain (visual grid patterns). Results don’t always translate to general capability.
Model rankings
Full leaderboard →- 0196.2%o4
- 0294.1%DeepSeek R2
- 0392.0%Gemini 2.5 Ultra
- 0491.5%GPT-5 Turbo
- 0589.3%Claude Opus 4.7
- 0688.7%DeepSeek V4
- 0786.5%Grok 4
- 0885.1%Qwen 3 Max
- 0985.0%Claude Sonnet 4.6
- 1084.2%Gemini 2.5 Flash
- 1182.0%Phi-5
- 1281.5%Llama 4 Behemoth
- 1379.8%GPT-5 Mini
- 1478.2%Mistral X-Large
- 1576.8%Claude Haiku 4.5
- 1672.5%Llama 4 Scout