HumanEval

Name: HumanEval Code Generation Benchmark
Creator: OpenAI

OpenAI’s 164 hand-written Python programming problems. Pass rate measures whether the model generates code that passes unit tests.

Metric

Percentage

Max score

100

Maintainer

OpenAI

Models scored

Why it matters

The oldest widely-cited coding benchmark. Still useful as a sanity check, but saturated — all frontier models now score in the 90s.

Small (164 problems), Python-only, and publicly available — contamination is a serious concern. Prefer SWE-Bench for anything real.