←All datasets
Directory · Datasets · Pretraining
PretrainingThe Pile
EleutherAI’s 800GB diverse text dataset. Historically significant; the substrate of GPT-J/Neo.
Size
800GB
Format
jsonl
License
MIT (code)
Maintainer
EleutherAI
What it\u2019s for
EleutherAI’s 800GB diverse text dataset. Historically significant; the substrate of GPT-J/Neo.