←All datasets
Directory · Datasets · Pretraining
PretrainingCommon Crawl
The raw web, captured monthly. The starting point for basically every LLM pretraining pipeline.
Size
300B+ pages
Format
WARC
License
Common Crawl Terms
Maintainer
Common Crawl Foundation
What it\u2019s for
The raw web, captured monthly. The starting point for basically every LLM pretraining pipeline.