Inference Index
Directory · 05 / 07

Benchmarks.

Not just scores — a guide to the evals themselves. What each one measures, who built it, and why the number matters.

/
Knowledge

MMLU-Pro

Tests expert-level knowledge across 57 academic and professional domains. The "Pro" version uses harder questions with more answer choices to better differentiate top models.

Type
Percentage
Max
100
Maintainer
TIGER-Lab
Key
mmlu_pro
Coding

HumanEval

OpenAI’s 164 hand-written Python programming problems. Pass rate measures whether the model generates code that passes unit tests.

Type
Percentage
Max
100
Maintainer
OpenAI
Key
humaneval
Coding

SWE-Bench

Tests whether an AI can resolve real GitHub issues from popular open-source projects. The gold standard for measuring coding agent capability.

Type
Percentage
Max
100
Maintainer
Princeton NLP
Key
swe_bench
Arena

LMSYS Chatbot Arena

Head-to-head blind comparisons by real users. The closest thing to a "real-world" ranking — models are judged on actual conversations.

Type
Elo
Max
ELO
Maintainer
LMSYS / UC Berkeley
Key
lmsys_elo
Math

MATH

Competition-level math problems from AMC/AIME/Olympiad datasets. Tests symbolic reasoning and multi-step solutions.

Type
Percentage
Max
100
Maintainer
UC Berkeley (Hendrycks et al.)
Key
math
Reasoning

GPQA

PhD-level science questions in biology, physics, and chemistry — written by domain experts, validated to be Google-proof.

Type
Percentage
Max
100
Maintainer
NYU (Rein et al.)
Key
gpqa
Multimodal

MMMU

College-level questions that require visual reasoning across diagrams, charts, and scientific figures.

Type
Percentage
Max
100
Maintainer
IN.AI Research
Key
mmmu
Instruction

IFEval

Google’s test of whether a model follows verifiable instructions — e.g., "write exactly 3 paragraphs, each starting with the letter B."

Type
Percentage
Max
100
Maintainer
Google DeepMind
Key
ifeval
Reasoning

ARC-AGI

Grid puzzles testing abstraction and analogical reasoning. Easy for humans, brutally hard for LLMs.

Type
Percentage
Max
100
Maintainer
François Chollet / ARC Prize
Key
math
Math

GSM8K

8.5K grade-school word problems. Saturated now, but still used as a smoke test for arithmetic reasoning.

Type
Percentage
Max
100
Maintainer
OpenAI
Key
math
Reasoning

BIG-Bench Hard

A 23-task subset of BIG-Bench where LLMs historically struggled. Covers logical deduction, navigation, and causal reasoning.

Type
Percentage
Max
100
Maintainer
Google
Key
gpqa
Safety

TruthfulQA

817 questions designed to trap models into repeating common misconceptions.

Type
Percentage
Max
100
Maintainer
Oxford
Key
ifeval
Safety

HarmBench

Standardized evaluation of model refusal behavior across 510 unique harmful scenarios.

Type
Percentage
Max
100
Maintainer
Center for AI Safety
Key
ifeval
Arena

WildBench

Evaluates models on real-world user queries scraped from chatbots, then graded pairwise.

Type
Elo
Max
ELO
Maintainer
Allen Institute for AI
Key
lmsys_elo
Reasoning

LiveBench

Contamination-free benchmark that refreshes its questions monthly.

Type
Percentage
Max
100
Maintainer
Abacus.AI
Key
mmlu_pro
Showing 15 of 15