←All benchmarks
Directory · Benchmarks · Reasoning
ReasoningGPQA
PhD-level science questions in biology, physics, and chemistry — written by domain experts, validated to be Google-proof.
Metric
Percentage
Max score
100
Maintainer
NYU (Rein et al.)
Models scored
6
Why it matters
A harder ceiling than MMLU-Pro. Humans with web access score ~34%. Frontier models now beat that — which is genuinely striking.
Known limitations
Only 448 questions. Small sample size means 2-3 points of noise.