Inference Index

←All benchmarks

Directory · Benchmarks · Reasoning

Reasoning

BIG-Bench Hard

A 23-task subset of BIG-Bench where LLMs historically struggled. Covers logical deduction, navigation, and causal reasoning.

Metric

Percentage

Max score

100

Maintainer

Google

Models scored

6

Why it matters

Targets the places where LLMs specifically fail. A diverse test of brittle reasoning skills.

Model rankings

Full leaderboard →