←All benchmarks
Directory · Benchmarks · Arena
ArenaLMSYS Chatbot Arena
Head-to-head blind comparisons by real users. The closest thing to a "real-world" ranking — models are judged on actual conversations.
Metric
Elo
Max score
ELO
Maintainer
LMSYS / UC Berkeley
Models scored
13
Why it matters
The least gameable benchmark. Humans vote, anonymously, on pairwise matchups. If users prefer model A over model B, that’s hard to fake.
Known limitations
Measures preference, not capability. Charm, verbosity, and confidence can inflate scores; deep reasoning doesn’t always help.