Inference Index
All benchmarks
Directory · Benchmarks · Reasoning
Reasoning

ARC-AGI

Grid puzzles testing abstraction and analogical reasoning. Easy for humans, brutally hard for LLMs.

Metric
Percentage
Max score
100
Maintainer
François Chollet / ARC Prize
Models scored
16

Why it matters

The most famous test that frontier models still struggle on. A canary for whether a model has actually generalized.

Known limitations

Small test set, narrow domain (visual grid patterns). Results don’t always translate to general capability.

Model rankings

Full leaderboard →