Inference Index
All datasets
Directory · Datasets · Pretraining
Pretraining

RedPajama v2

30T-token open recreation of the LLaMA pretraining dataset, with quality signals built in.

Size
30T tokens
Format
jsonl
License
Apache 2.0 (code)
Maintainer
Together AI

What it\u2019s for

30T-token open recreation of the LLaMA pretraining dataset, with quality signals built in.