Inference Index
Directory · 07 / 07

Datasets.

The fuel behind every model. Pretraining corpora, instruction sets, evaluation suites, and domain-specific data.

/
Pretraining

FineWeb

HuggingFace’s 15T-token curated web dataset. The highest-quality openly available pretraining data.

Size
15T tokens
Format
parquet
License
ODC-By 1.0
Maintainer
HuggingFace
Code

The Stack v2

The largest open-source code dataset. 67.5TB of permissively licensed source code across 600+ languages.

Size
67.5TB
Format
parquet
License
Multiple (per-file)
Maintainer
BigCode / HuggingFace
Pretraining

Common Crawl

The raw web, captured monthly. The starting point for basically every LLM pretraining pipeline.

Size
300B+ pages
Format
WARC
License
Common Crawl Terms
Maintainer
Common Crawl Foundation
Pretraining

RedPajama v2

30T-token open recreation of the LLaMA pretraining dataset, with quality signals built in.

Size
30T tokens
Format
jsonl
License
Apache 2.0 (code)
Maintainer
Together AI
Pretraining

The Pile

EleutherAI’s 800GB diverse text dataset. Historically significant; the substrate of GPT-J/Neo.

Size
800GB
Format
jsonl
License
MIT (code)
Maintainer
EleutherAI
Pretraining

Dolma v1.7

Allen Institute’s 3T-token open-license dataset. Fully documented provenance and filtering steps.

Size
3T tokens
Format
jsonl
License
ImpACT MR
Maintainer
Allen Institute for AI
Instruction

ShareGPT

90K real user-ChatGPT conversations. The dataset behind Vicuna; the OG instruction-tuning corpus.

Size
~90K conversations
Format
jsonl
License
Research only
Maintainer
Community
Instruction

OpenAssistant

Crowd-sourced conversational assistant data. Fully open, used to train OpenAssistant and its descendants.

Size
161K conversations
Format
jsonl
License
Apache 2.0
Maintainer
LAION
Instruction

Dolly 15K

Databricks’ 15K hand-written instruction dataset. Commercially-usable — every row is CC-BY.

Size
15K examples
Format
jsonl
License
CC-BY-SA-3.0
Maintainer
Databricks
Instruction

Alpaca

Stanford’s 52K self-instruct dataset. Research-use only, but spawned an entire generation of open models.

Size
52K examples
Format
jsonl
License
CC-BY-NC-4.0
Maintainer
Stanford
Multimodal

LAION-5B

5.85B image-text pairs scraped from Common Crawl. The foundation of open text-to-image.

Size
5.85B pairs
Format
parquet
License
CC-BY-4.0 (metadata)
Maintainer
LAION
Multimodal

DataComp-1B

Rigorously-filtered 1.28B image-text pairs. Quality-filtered subset that outperforms raw LAION at equivalent scale.

Size
1.28B pairs
Format
parquet
License
CC-BY-4.0
Maintainer
University of Washington / LAION / HuggingFace
Evaluation

MMLU-Pro (test set)

The evaluation data behind MMLU-Pro. Use it to benchmark new models on expert knowledge.

Size
12K questions
Format
jsonl
License
MIT
Maintainer
TIGER-Lab
Evaluation

SWE-Bench

2,294 real GitHub issues from 12 popular Python repos. The gold-standard dataset for coding agents.

Size
2.3K issues
Format
jsonl
License
MIT
Maintainer
Princeton NLP
Code

CodeSearchNet

6M functions with docstrings, across 6 languages. Canonical dataset for code search and embedding models.

Size
6M functions
Format
jsonl
License
Apache 2.0
Maintainer
GitHub
Domain

arXiv Dataset

Full-text snapshot of the arXiv preprint archive. The substrate of every AI-science-assistant startup.

Size
2.3M papers
Format
parquet
License
CC0 (metadata) + per-paper
Maintainer
arXiv / Cornell
Domain

PubMed

36M biomedical abstracts. The standard corpus for medical AI fine-tuning.

Size
36M abstracts
Format
XML
License
Public Domain (US Government)
Maintainer
NIH / NLM
Instruction

UltraChat

1.4M high-quality multi-turn conversations generated by GPT-3.5/4. Open-license SFT goldmine.

Size
1.4M conversations
Format
jsonl
License
MIT
Maintainer
Tsinghua / OpenBMB
Showing 18 of 18