SkillAgentSearch skills...

Beavertails

BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).

Install / Use

/learn @PKU-Alignment/Beavertails

README

<!-- markdownlint-disable first-line-h1 --> <!-- markdownlint-disable html --> <div align="center"> <img src="images/PKU-BeaverTails.png" width="80%"/> </div> <h1 align="center">A Human-Preference Dataset for Improving Safety Alignment of large Language Models</h1>

Code License Data License

[📕 Paper] [🤗 SafeRLHF Datasets] [🤗 BeaverTails] [🤗 Beaver Evaluation] [🤗 BeaverDam-7B] [BibTeX]

BeaverTails is an extensive collection of datasets specifically developed to support research on safety alignment in large language models (LLMs). The collection currently consists of three datasets:

  • A comprehensive classification dataset (PKU-Alignment/BeaverTails) with over 300k+ examples.
  • A preference dataset (PKU-Alignment/PKU-SafeRLHF) containing more than 300k+ instances.
  • A meticulously crafted evaluation dataset of 700 prompts is available (PKU-Alignment/BeaverTails-Evaluation) for assessing performance. This includes a mix of GPT-3.5 generated and human-written prompts. Our ongoing research endeavors will focus on expanding the dataset to further augment its size and usefulness.

Table of Contents <!-- omit in toc --> <!-- markdownlint-disable heading-increment -->

🦫 What's New?

  • 2023/07/10: We announce the open-sourcing of the trained weights for our QA-Moderation model on Hugging Face: PKU-Alignment/beaver-dam-7b. This model was meticulously developed using our proprietary Classification Dataset. In addition, the accompanying training code has also been made openly available for the community.
  • 2023/06/29 We have further open-sourced a larger-scale dataset of BeaverTails. It has now reached over 300k instances, including 301k training samples and 33.4k testing samples, more details can be referred to our Hugging Face dataset PKU-Alignment/BeaverTails.

Dataset Release

Classification Dataset

This dataset consists of 300k+ human-labeled question-answering (QA) pairs, each associated with specific harm categories. It is important to note that a single QA pair can be linked to more than one category. The dataset includes the following 14 harm categories:

  1. Animal Abuse
  2. Child Abuse
  3. Controversial Topics, Politics
  4. Discrimination, Stereotype, Injustice
  5. Drug Abuse, Weapons, Banned Substance
  6. Financial Crime, Property Crime, Theft
  7. Hate Speech, Offensive Language
  8. Misinformation Regarding ethics, laws, and safety
  9. Non-Violent Unethical Behavior
  10. Privacy Violation
  11. Self-Harm
  12. Sexually Explicit, Adult Content
  13. Terrorism, Organized Crime
  14. Violence, Aiding and Abetting, Incitement

The distribution of these 14 categories within the dataset is visualized in the following figure:

<div align="center"> <img src="images/dataset-distribution.png" width="85%"/> </div>

For more information and access to the data, please refer to:

Preference Dataset

The preference dataset consists of 300k+ expert comparison data. Each entry in this dataset includes two responses to a question, along with safety meta-labels and preferences for both responses, taking into consideration their helpfulness and harmlessness.

The annotation pipeline for this dataset is depicted in the following image:

<div align="center"> <img src="images/annotation-pipeline.png" width="85%"/> </div>

For more information and access to the data, please refer to:

Evaluation Dataset

Our evaluation dataset consists of 700 carefully crafted prompts that span across the 14 harm categories and 50 for each category. The purpose of this dataset is to provide a comprehensive set of prompts for testing purposes. Researchers can utilize these prompts to generate outputs from their own models, such as GPT-4 responses, and evaluate their performances.

For more information and access to the data, please refer to:

How to Use BeaverTails Datasets

Train a QA-Moderation to Judge QA Pairs

Our 🤗 Hugging Face BeaverTails dataset can be used to train a QA-Moderation model to judge QA pairs:

<div align="center"> <img src="images/moderation.png" width="90%"/> </div>

In this paradigm, a QA pair is labeled as harmful or harmless based on its risk neutrality extent, that is, the degree to which potential risks in a potentially harmful question can be mitigated by a benign response.

In our examples directory, we provide our training and evaluation code for the QA-Moderation model. We also provide the trained weights of our QA-Moderation model on Hugging Face: PKU-Alignment/beaver-dam-7b.

Train a Helpful and Harmless Assistant

Through

View on GitHub
GitHub Stars178
CategoryDesign
Updated7d ago
Forks6

Languages

Makefile

Security Score

100/100

Audited on Mar 22, 2026

No findings