<div align="center">LLMDataHub: Awesome Datasets for LLM Training </div>

<p align="center"> 🔥 <a href="#general_aligment" target="_blank">Alignment Datasets</a> • 💡 <a href="#domain-specific" target="_blank">Domain-specific Datasets</a> • :atom: <a href="#pretrain" target="_blank">Pretraining Datasets</a> 🖼️ <a href="#multimodal" target="_blank">Multimodal Datasets</a> <br> </p> <p align="center"> <img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/Zjh-819/LLMDataHub"> <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/Zjh-819/LLMDataHub"> </p>

Introduction 📄

Large language models (LLMs), such as OpenAI's GPT series, Google's Bard, and Baidu's Wenxin Yiyan, are driving profound technological changes. Recently, with the emergence of open-source large model frameworks like LlaMa and ChatGLM, training an LLM is no longer the exclusive domain of resource-rich companies. Training LLMs by small organizations or individuals has become an important interest in the open-source community, with some notable works including Alpaca, Vicuna, and Luotuo. In addition to large model frameworks, large-scale and high-quality training corpora are also essential for training large language models. Currently, relevant open-source corpora in the community are still scattered. Therefore, the goal of this repository is to continuously collect high-quality training corpora for LLMs in the open-source community.

Training a chatbot LLM that can follow human instruction effectively requires access to high-quality datasets that cover a range of conversation domains and styles. In this repository, we provide a curated collection of datasets specifically designed for chatbot training, including links, size, language, usage, and a brief description of each dataset. Our goal is to make it easier for researchers and practitioners to identify and select the most relevant and useful datasets for their chatbot LLM training needs. Whether you're working on improving chatbot dialogue quality, response generation, or language understanding, this repository has something for you.

Contact 📬 <br/>

If you want to contribute, you can contact:

Junhao Zhao 📧 <br/> Advised by Prof. Wanyun Cui

<div id="general_aligment">General Open Access Datasets for Alignment 🟢:</div>

Type Tags 🏷️:

SFT: Supervised Finetune
- Dialog: Each entry contains continuous conversations
- Pairs: Each entry is an input-output pair
- Context: Each entry has a context text and related QA pairs
PT: pretrain
CoT: Chain-of-Thought Finetune
RLHF: train reward model in Reinforcement Learning with Human Feedback

Datasets released in November 2023

| Dataset name | Used by | Type | Language | Size | Description ️ | |----------------------------------------------------------------------|---------|------|----------|---------------|------------------------------------------------------------------------------------------------------------------------| | helpSteer | / | RLHF | English | 37k instances | An RLHF dataset that is annotated by human with helpfulness, correctness, coherence, complexity and verbosity measures | | no_robots | / | SFT | English | 10k instance | High-quality human-created STF data, single turn. |

Datasets released in September 2023

Datasets released in August 2023

| Dataset name | Used by | Type | Language | Size | Description ️ | |---------------------------------------------------------------------------------------------------------|---------------------------|---------------------|---------------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------| | function_<br/>calling_<br/>extended | / | Pairs | English<br/>code | / | High quality human created dataset from enhance LM's API using ability. | | AmericanStories | / | PT | English | / | Vast sized corpus scanned from US Library of Congress. | | dolma | OLMo | PT | / | 3T tokens | A large diverse open-source corpus for LM pretraining. | | Platypus | Platypus2 | Pairs | English | 25K | A very high quality dataset for improving LM's STEM reasoning ability. | | Puffin | Redmond-Puffin<br/>Series | Dialog | English | ~3k entries | A dataset consists of conversations between real human and GPT-4，which features long context (over 1k tokens per conversation) and multi-turn dialogs. | | tiny series | / | Pairs | English | / | A series of short and concise codes or texts aim at improving LM's reasoning ability. | | LongBench | / | Evaluation<br/>Only | English<br/>Chinese | 17 tasks | A benchmark for evaluate LLM's long context understanding capability. |

LLMDataHub

Install / Use

README