πŸ“Š Data & Resources

Datasets, tools, and resources for AI research

πŸ—‚οΈ Datasets

Common Crawl

Petabyte-scale web crawl data used as the raw foundation for many LLM training corpora

The Pile

800GB curated mixture of high-quality text including books, code, academic papers, and web data

High-Quality Data View Dataset β†’

OpenWebText

Open-source recreation of OpenAI’s WebText dataset, sourced from high-quality Reddit-linked pages

High quality Web Text View Dataset β†’

FineWeb

Massively filtered and deduplicated Common Crawl dataset optimized for training large language models

Filtered Crawl Data View Dataset β†’

C4 (Colossal Clean Crawled Corpus)

Large-scale cleaned web text dataset used to train T5 and other foundational models

Cleaned Web Text View Dataset β†’

Wikipedia

High-quality encyclopedic text used for factual grounding and clean language modeling

Encyclopedic Text View Dataset β†’

BooksCorpus

Collection of unpublished books used to teach long-form coherence and narrative structure

The Stack

Massive multilingual source code dataset from GitHub used for training code-capable LLMs

RedPajama

[Includes other datasets 1 T] Open-source replication of LLaMA-style training data including web, books, code, and Wikipedia

LLaMA-style Data View Dataset β†’

LAION-5B

Large-scale image–text dataset used for training multimodal and diffusion-based models

Multimodal Data View Dataset β†’

Gutenberg Project

Public-domain books dataset widely used for long-context and literary language modeling

Public Domain Books View Dataset β†’

πŸ› οΈ LLM Evaluations

MMLU Dataset

Massive Multitask Language Understanding - evaluating knowledge across diverse domains

Explore β†’

HumanEval

Benchmarking code generation capabilities of language models

Explore β†’

TruthfulQA

Evaluating truthfulness and hallucination tendencies in LLMs

Explore β†’

ImageNet

Large-scale visual database for training and evaluating vision models

Explore β†’

πŸ› οΈ Agent Tools

LangChain

Framework for building applications with LLMs through composability

Explore β†’

Hugging Face Transformers

State-of-the-art machine learning for PyTorch and TensorFlow

Explore β†’

OpenCode

Open-source framework for building AI agents that can write and understand code

Explore β†’

llamaCpp

Efficient implementation of LLaMA models for local inference

Explore β†’

Docker

Platform for developing, shipping, and running applications in containers: Very usefule for safe agentic AIs

Explore β†’