←All datasets
Directory · Datasets · Pretraining
PretrainingRedPajama v2
30T-token open recreation of the LLaMA pretraining dataset, with quality signals built in.
Size
30T tokens
Format
jsonl
License
Apache 2.0 (code)
Maintainer
Together AI
What it\u2019s for
30T-token open recreation of the LLaMA pretraining dataset, with quality signals built in.