Inference Index
All datasets
Directory · Datasets · Pretraining
Pretraining

FineWeb

FineWeb is aggressively deduplicated and filtered from Common Crawl. The FineWeb-Edu subset adds educational-content classifiers, pushing downstream model quality further. It set the 2024 open-data bar.

Size
15T tokens
Format
parquet
License
ODC-By 1.0
Maintainer
HuggingFace

What it\u2019s for

HuggingFace’s 15T-token curated web dataset. The highest-quality openly available pretraining data.

Known training usage