Inference Index
All benchmarks
Directory · Benchmarks · Reasoning
Reasoning

BIG-Bench Hard

A 23-task subset of BIG-Bench where LLMs historically struggled. Covers logical deduction, navigation, and causal reasoning.

Metric
Percentage
Max score
100
Maintainer
Google
Models scored
6

Why it matters

Targets the places where LLMs specifically fail. A diverse test of brittle reasoning skills.

Model rankings

Full leaderboard →