Inference Index
All benchmarks
Directory · Benchmarks · Coding
Coding

HumanEval

OpenAI’s 164 hand-written Python programming problems. Pass rate measures whether the model generates code that passes unit tests.

Metric
Percentage
Max score
100
Maintainer
OpenAI
Models scored
22

Why it matters

The oldest widely-cited coding benchmark. Still useful as a sanity check, but saturated — all frontier models now score in the 90s.

Known limitations

Small (164 problems), Python-only, and publicly available — contamination is a serious concern. Prefer SWE-Bench for anything real.

Model rankings

Full leaderboard →