Inference Index
All benchmarks
Directory · Benchmarks · Coding
Coding

SWE-Bench

Tests whether an AI can resolve real GitHub issues from popular open-source projects. The gold standard for measuring coding agent capability.

Metric
Percentage
Max score
100
Maintainer
Princeton NLP
Models scored
8

Why it matters

Not "can it write code" but "can it fix real bugs in real codebases." The closest thing we have to measuring software engineering ability.

Known limitations

Still Python-dominated, still GitHub-dominated. And frontier scores can vary by 15+ points depending on the agent harness wrapping the model.

Model rankings

Full leaderboard →