SkillAgentSearch skills...

LLMDataHub

A quick guide (especially) for trending instruction finetuning datasets

Install / Use

/learn @Zjh-819/LLMDataHub
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center" width="60%"> <img src="LOGO.png" width="40%" height="40%"> </p>

<div align="center">LLMDataHub: Awesome Datasets for LLM Training </div>


<p align="center"> 🔥 <a href="#general_aligment" target="_blank">Alignment Datasets</a> • 💡 <a href="#domain-specific" target="_blank">Domain-specific Datasets</a> • :atom: <a href="#pretrain" target="_blank">Pretraining Datasets</a> 🖼️ <a href="#multimodal" target="_blank">Multimodal Datasets</a> <br> </p> <p align="center"> <img alt="GitHub last commit" src="https://img.shields.io/github/last-commit/Zjh-819/LLMDataHub"> <img alt="GitHub Repo stars" src="https://img.shields.io/github/stars/Zjh-819/LLMDataHub"> </p>

Introduction 📄

Large language models (LLMs), such as OpenAI's GPT series, Google's Bard, and Baidu's Wenxin Yiyan, are driving profound technological changes. Recently, with the emergence of open-source large model frameworks like LlaMa and ChatGLM, training an LLM is no longer the exclusive domain of resource-rich companies. Training LLMs by small organizations or individuals has become an important interest in the open-source community, with some notable works including Alpaca, Vicuna, and Luotuo. In addition to large model frameworks, large-scale and high-quality training corpora are also essential for training large language models. Currently, relevant open-source corpora in the community are still scattered. Therefore, the goal of this repository is to continuously collect high-quality training corpora for LLMs in the open-source community.

Training a chatbot LLM that can follow human instruction effectively requires access to high-quality datasets that cover a range of conversation domains and styles. In this repository, we provide a curated collection of datasets specifically designed for chatbot training, including links, size, language, usage, and a brief description of each dataset. Our goal is to make it easier for researchers and practitioners to identify and select the most relevant and useful datasets for their chatbot LLM training needs. Whether you're working on improving chatbot dialogue quality, response generation, or language understanding, this repository has something for you.

Contact 📬 <br/>

If you want to contribute, you can contact:

Junhao Zhao 📧 <br/> Advised by Prof. Wanyun Cui

<div id="general_aligment">General Open Access Datasets for Alignment 🟢:</div>

Type Tags 🏷️:

  • SFT: Supervised Finetune
    • Dialog: Each entry contains continuous conversations
    • Pairs: Each entry is an input-output pair
    • Context: Each entry has a context text and related QA pairs
  • PT: pretrain
  • CoT: Chain-of-Thought Finetune
  • RLHF: train reward model in Reinforcement Learning with Human Feedback

Datasets released in November 2023

| Dataset name | Used by | Type | Language | Size | Description ️ | |----------------------------------------------------------------------|---------|------|----------|---------------|------------------------------------------------------------------------------------------------------------------------| | helpSteer | / | RLHF | English | 37k instances | An RLHF dataset that is annotated by human with helpfulness, correctness, coherence, complexity and verbosity measures | | no_robots | / | SFT | English | 10k instance | High-quality human-created STF data, single turn. |

Datasets released in September 2023

| Dataset name | Used by | Type | Language | Size | Description ️ | |------------------------------------------------------------------------------------------------------------------|---------|------------|----------|-------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Anthropic_<br/>HH_Golden | ULMA | SFT / RLHF | English | train 42.5k + test 2.3k | Improved on the harmless dataset of Anthropic's Helpful and Harmless (HH) datasets. Using GPT4 to rewrite the original "chosen" answer. Compared with the original Harmless dataset, empirically this dataset improves the performance of RLHF, DPO or ULMA methods significantly on harmless metrics. |

Datasets released in August 2023

| Dataset name | Used by | Type | Language | Size | Description ️ | |---------------------------------------------------------------------------------------------------------|---------------------------|---------------------|---------------------|-------------|--------------------------------------------------------------------------------------------------------------------------------------------------------| | function_<br/>calling_<br/>extended | / | Pairs | English<br/>code | / | High quality human created dataset from enhance LM's API using ability. | | AmericanStories | / | PT | English | / | Vast sized corpus scanned from US Library of Congress. | | dolma | OLMo | PT | / | 3T tokens | A large diverse open-source corpus for LM pretraining. | | Platypus | Platypus2 | Pairs | English | 25K | A very high quality dataset for improving LM's STEM reasoning ability. | | Puffin | Redmond-Puffin<br/>Series | Dialog | English | ~3k entries | A dataset consists of conversations between real human and GPT-4,which features long context (over 1k tokens per conversation) and multi-turn dialogs. | | tiny series | / | Pairs | English | / | A series of short and concise codes or texts aim at improving LM's reasoning ability. | | LongBench | / | Evaluation<br/>Only | English<br/>Chinese | 17 tasks | A benchmark for evaluate LLM's long context understanding capability. |

Datasets released in July 2023

| Dataset name | Used by | Type | Language | Size | Description ️ | |-------------------------------------------------------------------------------------------------------------|--------------|-----------------|--------------|-------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | orca-chat | / | Dialog | English | 198,463 entries | An Orca-style dialog dataset aims at improving LM's long context conversational ability. | | DialogStudio | / | Dialog | Multilingual | / | A collection of diverse datasets aim at building conversational Chatb

View on GitHub
GitHub Stars3.4k
CategoryProduct
Updated2d ago
Forks233

Security Score

100/100

Audited on Mar 23, 2026

No findings