←All benchmarks
Directory · Benchmarks · Reasoning
ReasoningBIG-Bench Hard
A 23-task subset of BIG-Bench where LLMs historically struggled. Covers logical deduction, navigation, and causal reasoning.
Metric
Percentage
Max score
100
Maintainer
Google
Models scored
6
Why it matters
Targets the places where LLMs specifically fail. A diverse test of brittle reasoning skills.