Inference Index
All datasets
Directory · Datasets · Pretraining
Pretraining

Common Crawl

The raw web, captured monthly. The starting point for basically every LLM pretraining pipeline.

Size
300B+ pages
Format
WARC
License
Common Crawl Terms
Maintainer
Common Crawl Foundation

What it\u2019s for

The raw web, captured monthly. The starting point for basically every LLM pretraining pipeline.