←All benchmarks
Directory · Benchmarks · Knowledge
KnowledgeMMLU-Pro
Tests expert-level knowledge across 57 academic and professional domains. The "Pro" version uses harder questions with more answer choices to better differentiate top models.
Metric
Percentage
Max score
100
Maintainer
TIGER-Lab
Models scored
22
Why it matters
If you need an AI that knows things — from medieval history to molecular biology — MMLU-Pro is the single best test of breadth.
Known limitations
Multiple-choice tests reward pattern matching. High MMLU doesn’t guarantee good reasoning under novel framing.
Model rankings
Full leaderboard →- 0193.8%o4
- 0292.8%Claude Opus 4.7
- 0391.2%GPT-5 Turbo
- 0490.4%Gemini 2.5 Ultra
- 0588.7%Claude Sonnet 4.6
- 0687.8%DeepSeek R2
- 0787.1%Llama 4 Behemoth
- 0886.2%Grok 4
- 0985.4%DeepSeek V4
- 1084.0%Qwen 3 Max
- 1183.5%Mistral X-Large
- 1282.5%Nova Pro
- 1382.1%Gemini 2.5 Flash
- 1481.4%Claude Haiku 4.5
- 1580.1%Command A
- 1679.2%Sonar Large
- 1778.5%GPT-5 Mini
- 1878.5%Inflection 3
- 1977.2%Llama 4 Scout
- 2076.8%Reka Flash 3
- 2175.5%Jamba 2
- 2274.0%Phi-5