←All datasets
Directory · Datasets · Pretraining
PretrainingFineWeb
FineWeb is aggressively deduplicated and filtered from Common Crawl. The FineWeb-Edu subset adds educational-content classifiers, pushing downstream model quality further. It set the 2024 open-data bar.
Size
15T tokens
Format
parquet
License
ODC-By 1.0
Maintainer
HuggingFace
What it\u2019s for
HuggingFace’s 15T-token curated web dataset. The highest-quality openly available pretraining data.