Inference Index

←All benchmarks

Directory · Benchmarks · Reasoning

Reasoning

ARC-AGI

Grid puzzles testing abstraction and analogical reasoning. Easy for humans, brutally hard for LLMs.

Metric

Percentage

Max score

100

Maintainer

François Chollet / ARC Prize

Models scored

16

Why it matters

The most famous test that frontier models still struggle on. A canary for whether a model has actually generalized.

Known limitations

Small test set, narrow domain (visual grid patterns). Results don’t always translate to general capability.

Model rankings

Full leaderboard →