Inference Index
All datasets
Directory · Datasets · Pretraining
Pretraining

The Pile

EleutherAI’s 800GB diverse text dataset. Historically significant; the substrate of GPT-J/Neo.

Size
800GB
Format
jsonl
License
MIT (code)
Maintainer
EleutherAI

What it\u2019s for

EleutherAI’s 800GB diverse text dataset. Historically significant; the substrate of GPT-J/Neo.