ποΈ Datasets
Common Crawl
Petabyte-scale web crawl data used as the raw foundation for many LLM training corpora
The Pile
800GB curated mixture of high-quality text including books, code, academic papers, and web data
OpenWebText
Open-source recreation of OpenAIβs WebText dataset, sourced from high-quality Reddit-linked pages
FineWeb
Massively filtered and deduplicated Common Crawl dataset optimized for training large language models
C4 (Colossal Clean Crawled Corpus)
Large-scale cleaned web text dataset used to train T5 and other foundational models
Wikipedia
High-quality encyclopedic text used for factual grounding and clean language modeling
BooksCorpus
Collection of unpublished books used to teach long-form coherence and narrative structure
The Stack
Massive multilingual source code dataset from GitHub used for training code-capable LLMs
RedPajama
[Includes other datasets 1 T] Open-source replication of LLaMA-style training data including web, books, code, and Wikipedia
LAION-5B
Large-scale imageβtext dataset used for training multimodal and diffusion-based models
Gutenberg Project
Public-domain books dataset widely used for long-context and literary language modeling
π οΈ LLM Evaluations
MMLU Dataset
Massive Multitask Language Understanding - evaluating knowledge across diverse domains
Explore βπ οΈ Agent Tools
OpenCode
Open-source framework for building AI agents that can write and understand code
Explore βDocker
Platform for developing, shipping, and running applications in containers: Very usefule for safe agentic AIs
Explore β