Directory · 07 / 07

Datasets.

The fuel behind every model. Pretraining corpora, instruction sets, evaluation suites, and domain-specific data.

18 tracked

Pretraining

FineWeb

HuggingFace’s 15T-token curated web dataset. The highest-quality openly available pretraining data.

The Stack v2

The largest open-source code dataset. 67.5TB of permissively licensed source code across 600+ languages.

BigCode / HuggingFace

Pretraining

Common Crawl

The raw web, captured monthly. The starting point for basically every LLM pretraining pipeline.

Common Crawl Foundation

Pretraining

RedPajama v2

30T-token open recreation of the LLaMA pretraining dataset, with quality signals built in.

The Pile

EleutherAI’s 800GB diverse text dataset. Historically significant; the substrate of GPT-J/Neo.

Dolma v1.7

Allen Institute’s 3T-token open-license dataset. Fully documented provenance and filtering steps.

Allen Institute for AI

Instruction

ShareGPT

90K real user-ChatGPT conversations. The dataset behind Vicuna; the OG instruction-tuning corpus.

OpenAssistant

Crowd-sourced conversational assistant data. Fully open, used to train OpenAssistant and its descendants.

Dolly 15K

Databricks’ 15K hand-written instruction dataset. Commercially-usable — every row is CC-BY.

Alpaca

Stanford’s 52K self-instruct dataset. Research-use only, but spawned an entire generation of open models.

LAION-5B

5.85B image-text pairs scraped from Common Crawl. The foundation of open text-to-image.

DataComp-1B

Rigorously-filtered 1.28B image-text pairs. Quality-filtered subset that outperforms raw LAION at equivalent scale.

University of Washington / LAION / HuggingFace

Evaluation

MMLU-Pro (test set)

The evaluation data behind MMLU-Pro. Use it to benchmark new models on expert knowledge.

SWE-Bench

2,294 real GitHub issues from 12 popular Python repos. The gold-standard dataset for coding agents.

CodeSearchNet

6M functions with docstrings, across 6 languages. Canonical dataset for code search and embedding models.

arXiv Dataset

Full-text snapshot of the arXiv preprint archive. The substrate of every AI-science-assistant startup.

CC0 (metadata) + per-paper

Maintainer

arXiv / Cornell

Domain

PubMed

36M biomedical abstracts. The standard corpus for medical AI fine-tuning.

Public Domain (US Government)

Maintainer

NIH / NLM

Instruction

UltraChat

1.4M high-quality multi-turn conversations generated by GPT-3.5/4. Open-license SFT goldmine.

Showing 18 of 18