←All benchmarks
Directory · Benchmarks · Reasoning
ReasoningLiveBench
Contamination-free benchmark that refreshes its questions monthly.
Metric
Percentage
Max score
100
Maintainer
Abacus.AI
Models scored
22
Why it matters
Questions change constantly, so training on the test set becomes meaningless. One of the most trustworthy rankings today.
Model rankings
Full leaderboard →- 0193.8%o4
- 0292.8%Claude Opus 4.7
- 0391.2%GPT-5 Turbo
- 0490.4%Gemini 2.5 Ultra
- 0588.7%Claude Sonnet 4.6
- 0687.8%DeepSeek R2
- 0787.1%Llama 4 Behemoth
- 0886.2%Grok 4
- 0985.4%DeepSeek V4
- 1084.0%Qwen 3 Max
- 1183.5%Mistral X-Large
- 1282.5%Nova Pro
- 1382.1%Gemini 2.5 Flash
- 1481.4%Claude Haiku 4.5
- 1580.1%Command A
- 1679.2%Sonar Large
- 1778.5%GPT-5 Mini
- 1878.5%Inflection 3
- 1977.2%Llama 4 Scout
- 2076.8%Reka Flash 3
- 2175.5%Jamba 2
- 2274.0%Phi-5