Directory · 01 / 07

The leaderboard.

Every model we track, sortable by ELO, benchmarks, cost, and context. Live-seeded from public sources and updated as new scores land.

22 tracked

Open-source only

#ModelELOMMLU-ProHumanEvalSWE-BenchInput $Output $Context

01
Claude Opus 4.7Anthropic
Anthropic’s frontier reasoning model. Leads SWE-Bench and holds top ranks on long-context coding and agentic workflows.
141292.894.472.1$15.00$75.001M
ELO 1412$15.00
02
GPT-5 TurboOpenAI
OpenAI’s flagship unified model. Handles text, vision, and audio natively. The generalist benchmark champion.
139891.293.065.8$6.00$24.00400K
ELO 1398$6.00
03
Gemini 2.5 UltraGoogle DeepMind
2M-token context, native video understanding, and Google’s deepest multimodal stack. The long-context king.
138590.490.258.3$7.00$21.002M
ELO 1385$7.00
04
Claude Sonnet 4.6Anthropic
The workhorse. Near-Opus quality at 1/5 the cost. The default choice for production code and agent workloads.
136888.792.164.5$3.00$15.00500K
ELO 1368$3.00
05
Grok 4xAI
xAI’s latest. Real-time X search integration, strong on current events and meme-literate tasks.
135886.288.4—$5.00$15.00256K
ELO 1358$5.00
06
Llama 4 BehemothMetaOpen
Meta’s open-weights flagship. 405B params, fully open license, runs on every major inference provider.
134287.189.352.4$2.50$8.00256K
ELO 1342$2.50
07
DeepSeek V4DeepSeekOpen
The shock-the-market moment of 2026. Frontier coding quality at 1/50th the price of Opus.
133485.491.554.1$0.14$0.28128K
ELO 1334$0.14
08
Mistral X-LargeMistralOpen
European frontier model. EU-hosted inference, strong European language coverage, Apache-licensed weights.
131883.588.0—$2.00$6.00256K
ELO 1318$2.00
09
Gemini 2.5 FlashGoogle DeepMind
Google’s price/performance darling. 1M context at $0.30/M input — nobody comes close on throughput-per-dollar.
131282.185.8—$0.30$1.201M
ELO 1312$0.30
10
Claude Haiku 4.5Anthropic
Fast, cheap, and smart enough for most routing and extraction tasks. Great base model for subagents.
130581.487.049.2$0.80$4.00200K
ELO 1305$0.80
11
Qwen 3 MaxQwenOpen
Alibaba’s flagship. Strongest non-English coverage of any open model; dominant in Asian markets.
129584.087.2—$1.60$6.40128K
ELO 1295$1.60
12
GPT-5 MiniOpenAI
GPT-5’s smaller, cheaper sibling. Optimized for chat and lightweight agent tasks.
128878.585.5—$0.50$2.00256K
ELO 1288$0.50
13
Llama 4 ScoutMetaOpen
The 70B workhorse open-weight. Fits on a single H100, beloved by self-hosters.
126577.282.4—$0.35$0.80128K
ELO 1265$0.35
14
o4OpenAI
Deep-reasoning model. Spends more tokens thinking to crush math, science, and hard coding problems.
—93.892.368.0$15.00$60.00200K
ELO —$15.00
15
DeepSeek R2DeepSeekOpen
DeepSeek’s reasoning variant. Competes with o4 on math at a fraction of the cost.
—87.890.1—$0.55$2.19128K
ELO —$0.55
16
Command ACohere
Cohere’s enterprise model. Built for RAG, tool use, and agentic workflows with strong citations.
—80.182.5—$2.50$10.00256K
ELO —$2.50
17
Nova ProAmazon
Amazon’s frontier model. Available exclusively on Bedrock; strong enterprise integration story.
—82.584.0—$0.80$3.20300K
ELO —$0.80
18
Inflection 3Inflection
Pi’s personality-tuned model. Famous for conversational warmth and emotional intelligence.
—78.579.8—$3.00$12.00128K
ELO —$3.00
19
Sonar LargePerplexity
Llama-tuned with a search-first system prompt. The canonical answer-with-citations model.
—79.278.5—$1.00$5.00128K
ELO —$1.00
20
Phi-5MicrosoftOpen
Microsoft’s small model champion. Punches well above its weight class; ideal for edge inference.
—74.080.5—$0.10$0.40128K
ELO —$0.10
21
Reka Flash 3Reka
Pure multimodal research lab. Video and audio native at a fraction of Gemini Ultra’s price.
—76.878.0—$0.40$1.00128K
ELO —$0.40
22
Jamba 2AI21Open
Hybrid Mamba + Transformer architecture. Linear cost scaling on long contexts, so the 256K window is actually fast.
—75.577.0—$0.50$1.50256K
ELO —$0.50