Inference Index

Directory · 05 / 07

Benchmarks.

Not just scores — a guide to the evals themselves. What each one measures, who built it, and why the number matters.

15 tracked

/

MMLU-Pro

Tests expert-level knowledge across 57 academic and professional domains. The "Pro" version uses harder questions with more answer choices to better differentiate top models.

HumanEval

OpenAI’s 164 hand-written Python programming problems. Pass rate measures whether the model generates code that passes unit tests.

SWE-Bench

Tests whether an AI can resolve real GitHub issues from popular open-source projects. The gold standard for measuring coding agent capability.

LMSYS Chatbot Arena

Head-to-head blind comparisons by real users. The closest thing to a "real-world" ranking — models are judged on actual conversations.

LMSYS / UC Berkeley

MATH

Competition-level math problems from AMC/AIME/Olympiad datasets. Tests symbolic reasoning and multi-step solutions.

UC Berkeley (Hendrycks et al.)

GPQA

PhD-level science questions in biology, physics, and chemistry — written by domain experts, validated to be Google-proof.

NYU (Rein et al.)

MMMU

College-level questions that require visual reasoning across diagrams, charts, and scientific figures.

IFEval

Google’s test of whether a model follows verifiable instructions — e.g., "write exactly 3 paragraphs, each starting with the letter B."

Google DeepMind

ARC-AGI

Grid puzzles testing abstraction and analogical reasoning. Easy for humans, brutally hard for LLMs.

François Chollet / ARC Prize

GSM8K

8.5K grade-school word problems. Saturated now, but still used as a smoke test for arithmetic reasoning.

BIG-Bench Hard

A 23-task subset of BIG-Bench where LLMs historically struggled. Covers logical deduction, navigation, and causal reasoning.

TruthfulQA

817 questions designed to trap models into repeating common misconceptions.

HarmBench

Standardized evaluation of model refusal behavior across 510 unique harmful scenarios.

Center for AI Safety

WildBench

Evaluates models on real-world user queries scraped from chatbots, then graded pairwise.

Allen Institute for AI

LiveBench

Contamination-free benchmark that refreshes its questions monthly.

Showing 15 of 15