←All benchmarks
Directory · Benchmarks · Coding
CodingHumanEval
OpenAI’s 164 hand-written Python programming problems. Pass rate measures whether the model generates code that passes unit tests.
Metric
Percentage
Max score
100
Maintainer
OpenAI
Models scored
22
Why it matters
The oldest widely-cited coding benchmark. Still useful as a sanity check, but saturated — all frontier models now score in the 90s.
Known limitations
Small (164 problems), Python-only, and publicly available — contamination is a serious concern. Prefer SWE-Bench for anything real.
Model rankings
Full leaderboard →- 0194.4%Claude Opus 4.7
- 0293.0%GPT-5 Turbo
- 0392.3%o4
- 0492.1%Claude Sonnet 4.6
- 0591.5%DeepSeek V4
- 0690.2%Gemini 2.5 Ultra
- 0790.1%DeepSeek R2
- 0889.3%Llama 4 Behemoth
- 0988.4%Grok 4
- 1088.0%Mistral X-Large
- 1187.2%Qwen 3 Max
- 1287.0%Claude Haiku 4.5
- 1385.8%Gemini 2.5 Flash
- 1485.5%GPT-5 Mini
- 1584.0%Nova Pro
- 1682.5%Command A
- 1782.4%Llama 4 Scout
- 1880.5%Phi-5
- 1979.8%Inflection 3
- 2078.5%Sonar Large
- 2178.0%Reka Flash 3
- 2277.0%Jamba 2