🚀 Try Zilliz Cloud, the fully managed Milvus, for free—experience 10x faster performance! Try Now>>

Milvus
Zilliz

What datasets are used to train LLMs?

Large language models (LLMs) are trained on diverse datasets compiled from publicly available text sources across the internet, books, code repositories, and specialized domains. These datasets are designed to expose the model to a wide range of language patterns, topics, and writing styles. The goal is to capture general knowledge, grammar, and reasoning abilities while balancing volume, quality, and ethical considerations. Most datasets are preprocessed to remove noise, duplicates, or harmful content, though approaches vary by organization.

Common sources include web content like Common Crawl, a massive snapshot of the open web containing trillions of words from blogs, forums, and news sites. For example, GPT-3 used filtered versions of Common Crawl alongside curated text from books (e.g., BookCorpus) and Wikipedia articles to improve factual accuracy. Academic papers from arXiv or PubMed are also used to train models on technical vocabulary, while platforms like Reddit provide conversational data. Code-centric models like Codex or StarCoder rely heavily on public code repositories such as GitHub, often filtered for permissively licensed projects. These datasets teach syntax, logic, and problem-solving patterns unique to programming languages.

Specialized datasets address gaps in general web data. For instance, multilingual models use OSCAR (a corpus of 166 languages) or mC4 to improve non-English performance. Models focused on dialogue might incorporate customer service logs or scripted movie conversations. Ethical and legal concerns shape dataset selection—for example, excluding personally identifiable information (PII) or copyrighted text. Organizations like EleutherAI curate transparent datasets (e.g., The Pile), combining niche sources like academic journals, emails, and government documents. Ultimately, the choice of datasets depends on the model’s intended use, balancing breadth, domain specificity, and compliance with data usage policies.

Like the article? Spread the word