Why we think Llama 4 Scout is the best sub-8B deployment today
Our analysis of price, quality, and fit for self-hosted workloads: why Scout wins for production sub-8B deployments.
We’ve been running internal benchmarks against our own Inference Index dataset, and the consistent winner for sub-8B-class workloads is Llama 4 Scout. It fits on a single H100, runs at 300+ tok/s on vLLM with FP8 quantization, and scores within 3 points of models 5x its parameter count on our instruction-following and long-context retrieval tests.
This is a tactical recommendation, not a "best overall" one — if you can afford Sonnet or Gemini Flash via an API, the engineering overhead is rarely worth the savings. But if you’re deploying models into air-gapped, VPC-only, or data-sovereign environments, Scout is the default.
Inference Index