Inference Index
All benchmarks
Directory · Benchmarks · Reasoning
Reasoning

GPQA

PhD-level science questions in biology, physics, and chemistry — written by domain experts, validated to be Google-proof.

Metric
Percentage
Max score
100
Maintainer
NYU (Rein et al.)
Models scored
6

Why it matters

A harder ceiling than MMLU-Pro. Humans with web access score ~34%. Frontier models now beat that — which is genuinely striking.

Known limitations

Only 448 questions. Small sample size means 2-3 points of noise.

Model rankings

Full leaderboard →