AnalysisMar 27, 2026 · 4mo ago

Why we think Llama 4 Scout is the best sub-8B deployment today

Our analysis of price, quality, and fit for self-hosted workloads: why Scout wins for production sub-8B deployments.

We’ve been running internal benchmarks against our own Inference Index dataset, and the consistent winner for sub-8B-class workloads is Llama 4 Scout. It fits on a single H100, runs at 300+ tok/s on vLLM with FP8 quantization, and scores within 3 points of models 5x its parameter count on our instruction-following and long-context retrieval tests.

This is a tactical recommendation, not a "best overall" one — if you can afford Sonnet or Gemini Flash via an API, the engineering overhead is rarely worth the savings. But if you’re deploying models into air-gapped, VPC-only, or data-sovereign environments, Scout is the default.

Byline

Inference Index

Why we think Llama 4 Scout is the best sub-8B deployment today

More analysis stories

DeepSeek V4 quietly reset the cost curve again