Reinforcement Learning Diffision models

🗂️ Datasets

Common Crawl

Petabyte-scale web crawl data used as the raw foundation for many LLM training corpora

Crawler Data View Dataset →

The Pile

800GB curated mixture of high-quality text including books, code, academic papers, and web data

High-Quality Data View Dataset →

OpenWebText

Open-source recreation of OpenAI’s WebText dataset, sourced from high-quality Reddit-linked pages

High quality Web Text View Dataset →

FineWeb

Massively filtered and deduplicated Common Crawl dataset optimized for training large language models

Filtered Crawl Data View Dataset →

C4 (Colossal Clean Crawled Corpus)

Large-scale cleaned web text dataset used to train T5 and other foundational models

Cleaned Web Text View Dataset →

Wikipedia

High-quality encyclopedic text used for factual grounding and clean language modeling

Encyclopedic Text View Dataset →

BooksCorpus

Collection of unpublished books used to teach long-form coherence and narrative structure

Books View Dataset →

The Stack

Massive multilingual source code dataset from GitHub used for training code-capable LLMs

Data View Dataset →

RedPajama

[Includes other datasets 1 T] Open-source replication of LLaMA-style training data including web, books, code, and Wikipedia

LLaMA-style Data View Dataset →

LAION-5B

Large-scale image–text dataset used for training multimodal and diffusion-based models

Multimodal Data View Dataset →

Gutenberg Project

Public-domain books dataset widely used for long-context and literary language modeling

Public Domain Books View Dataset →

🛠️ LLM Evaluations

MMLU Dataset

Massive Multitask Language Understanding - evaluating knowledge across diverse domains

Explore →

HumanEval

Benchmarking code generation capabilities of language models

Explore →

TruthfulQA

Evaluating truthfulness and hallucination tendencies in LLMs

Explore →

ImageNet

Large-scale visual database for training and evaluating vision models

Explore →

🛠️ Agent Tools

LangChain

Framework for building applications with LLMs through composability

Explore →

Hugging Face Transformers

State-of-the-art machine learning for PyTorch and TensorFlow

Explore →

OpenCode

Open-source framework for building AI agents that can write and understand code

Explore →

llamaCpp

Efficient implementation of LLaMA models for local inference

Explore →

Docker

Platform for developing, shipping, and running applications in containers: Very usefule for safe agentic AIs

Explore →