SWE-Bench

Name: Software Engineering Benchmark
Creator: Princeton NLP

Tests whether an AI can resolve real GitHub issues from popular open-source projects. The gold standard for measuring coding agent capability.

Metric

Percentage

Max score

100

Maintainer

Princeton NLP

Models scored

Why it matters

Not "can it write code" but "can it fix real bugs in real codebases." The closest thing we have to measuring software engineering ability.

Still Python-dominated, still GitHub-dominated. And frontier scores can vary by 15+ points depending on the agent harness wrapping the model.