←All datasets
Directory · Datasets · Pretraining
PretrainingDolma v1.7
Allen Institute’s 3T-token open-license dataset. Fully documented provenance and filtering steps.
Size
3T tokens
Format
jsonl
License
ImpACT MR
Maintainer
Allen Institute for AI
What it\u2019s for
Allen Institute’s 3T-token open-license dataset. Fully documented provenance and filtering steps.