Inference Index
All benchmarks
Directory · Benchmarks · Arena
Arena

LMSYS Chatbot Arena

Head-to-head blind comparisons by real users. The closest thing to a "real-world" ranking — models are judged on actual conversations.

Metric
Elo
Max score
ELO
Maintainer
LMSYS / UC Berkeley
Models scored
13

Why it matters

The least gameable benchmark. Humans vote, anonymously, on pairwise matchups. If users prefer model A over model B, that’s hard to fake.

Known limitations

Measures preference, not capability. Charm, verbosity, and confidence can inflate scores; deep reasoning doesn’t always help.

Model rankings

Full leaderboard →