Common Crawl

The raw web, captured monthly. The starting point for basically every LLM pretraining pipeline.

Size

300B+ pages

Format

WARC

License

Common Crawl Terms

Maintainer

Common Crawl Foundation

What it\u2019s for

The raw web, captured monthly. The starting point for basically every LLM pretraining pipeline.