Datasets.
The fuel behind every model. Pretraining corpora, instruction sets, evaluation suites, and domain-specific data.
FineWeb
HuggingFace’s 15T-token curated web dataset. The highest-quality openly available pretraining data.
The Stack v2
The largest open-source code dataset. 67.5TB of permissively licensed source code across 600+ languages.
Common Crawl
The raw web, captured monthly. The starting point for basically every LLM pretraining pipeline.
RedPajama v2
30T-token open recreation of the LLaMA pretraining dataset, with quality signals built in.
The Pile
EleutherAI’s 800GB diverse text dataset. Historically significant; the substrate of GPT-J/Neo.
Dolma v1.7
Allen Institute’s 3T-token open-license dataset. Fully documented provenance and filtering steps.
ShareGPT
90K real user-ChatGPT conversations. The dataset behind Vicuna; the OG instruction-tuning corpus.
OpenAssistant
Crowd-sourced conversational assistant data. Fully open, used to train OpenAssistant and its descendants.
Dolly 15K
Databricks’ 15K hand-written instruction dataset. Commercially-usable — every row is CC-BY.
Alpaca
Stanford’s 52K self-instruct dataset. Research-use only, but spawned an entire generation of open models.
LAION-5B
5.85B image-text pairs scraped from Common Crawl. The foundation of open text-to-image.
DataComp-1B
Rigorously-filtered 1.28B image-text pairs. Quality-filtered subset that outperforms raw LAION at equivalent scale.
MMLU-Pro (test set)
The evaluation data behind MMLU-Pro. Use it to benchmark new models on expert knowledge.
SWE-Bench
2,294 real GitHub issues from 12 popular Python repos. The gold-standard dataset for coding agents.
CodeSearchNet
6M functions with docstrings, across 6 languages. Canonical dataset for code search and embedding models.
arXiv Dataset
Full-text snapshot of the arXiv preprint archive. The substrate of every AI-science-assistant startup.
PubMed
36M biomedical abstracts. The standard corpus for medical AI fine-tuning.
UltraChat
1.4M high-quality multi-turn conversations generated by GPT-3.5/4. Open-license SFT goldmine.