Inference Index
All datasets
Directory · Datasets · Pretraining
Pretraining

Dolma v1.7

Allen Institute’s 3T-token open-license dataset. Fully documented provenance and filtering steps.

Size
3T tokens
Format
jsonl
License
ImpACT MR
Maintainer
Allen Institute for AI

What it\u2019s for

Allen Institute’s 3T-token open-license dataset. Fully documented provenance and filtering steps.

Known training usage