Benchmarks.
Not just scores — a guide to the evals themselves. What each one measures, who built it, and why the number matters.
MMLU-Pro
Tests expert-level knowledge across 57 academic and professional domains. The "Pro" version uses harder questions with more answer choices to better differentiate top models.
HumanEval
OpenAI’s 164 hand-written Python programming problems. Pass rate measures whether the model generates code that passes unit tests.
SWE-Bench
Tests whether an AI can resolve real GitHub issues from popular open-source projects. The gold standard for measuring coding agent capability.
LMSYS Chatbot Arena
Head-to-head blind comparisons by real users. The closest thing to a "real-world" ranking — models are judged on actual conversations.
MATH
Competition-level math problems from AMC/AIME/Olympiad datasets. Tests symbolic reasoning and multi-step solutions.
GPQA
PhD-level science questions in biology, physics, and chemistry — written by domain experts, validated to be Google-proof.
MMMU
College-level questions that require visual reasoning across diagrams, charts, and scientific figures.
IFEval
Google’s test of whether a model follows verifiable instructions — e.g., "write exactly 3 paragraphs, each starting with the letter B."
ARC-AGI
Grid puzzles testing abstraction and analogical reasoning. Easy for humans, brutally hard for LLMs.
GSM8K
8.5K grade-school word problems. Saturated now, but still used as a smoke test for arithmetic reasoning.
BIG-Bench Hard
A 23-task subset of BIG-Bench where LLMs historically struggled. Covers logical deduction, navigation, and causal reasoning.
TruthfulQA
817 questions designed to trap models into repeating common misconceptions.
HarmBench
Standardized evaluation of model refusal behavior across 510 unique harmful scenarios.
WildBench
Evaluates models on real-world user queries scraped from chatbots, then graded pairwise.
LiveBench
Contamination-free benchmark that refreshes its questions monthly.