HateBench

[USENIX'25] HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns

Generate Convert Improve

Install / Use

/learn @TrustAIRLab/HateBench

About this skill

Quality Score

0/100

README

HateBench

This is the official repository of the USENIX 2025 paper HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns.

In this paper, we propose HateBench, a framework designed to benchmark hate speech detectors on LLM-generated content.

Disclaimer. This repo contains examples of hateful and abusive language. Reader discretion is recommended. This repo is intended for research purposes only. Any misuse is strictly prohibited.

Overview

Our artifact repository includes:

HateBench, the framework designed to benchmark hate speech detectors on LLM-generated content.
HateBenchSet, the manually-annotated dataset, comprising 7,838 samples across 34 identity groups, generated by LLMs.
Code for reproducing the LLM hate campaign, including both the adversarial hate campaign and stealthy hate campaign.
Scripts to generate the key result tables and figures from the paper, including:
- Table 3: Performance on LLM-generated samples.
- Table 4: F1-score on LLM-generated and human-written samples.
- Table 6: Performance of adversarial hate campaign
- Table 8: Performance of model stealing attacks.
- Table 9: Performance of stealthy hate campaign with black-box attacks.
- Table 10: Performance of stealthy hate campaign with white-box gradient optimization.

HateBench

HateBenchSet

HateBenchSet is provided on Hugging Face.

from datasets import load_dataset
dataset = load_dataset("TrustAIRLab/HateBenchSet", "default")

Data structure:

| Column | Description | | ------------- | ----------------------------------------------------------- | | model | Model used to generate response. | | status | Status of the model, i.e., original or jailbreak. | | status_prompt | Prompt used to set the model. | | main_target | The category of identity groups, e.g., race, religion, etc. | | sub_target | The identity group. | | target_name | The complete name of the identity group. | | pid | Prompt id. | | prompt | The prompt. | | text | The sample generated by the model. | | hate_label | 1 denotes Hate, 0 refers to Non-Hate. |

We also provide a labeled version of HateBenchSet, which is HateBenchSet with the predictions of the 8 detectors evaluated in our paper.

Specifically, for each detector, the predictions are recorded in the following columns:

{detector}: the complete record returned by the detector.
{detector}_score: the hate score of the sample.
{detector}_flagged: whether the sample is predicted as hate or not.

from datasets import load_dataset
dataset = load_dataset("TrustAIRLab/HateBenchSet", "labeled")

LLM-Driven Hate Campaign

Given the ethical concerns, code are provided in Zenodo with the request-access feature enabled. We will manually review the applicants’ information to approve the application.

Ethics & Disclosure

Our work relies on LLMs to generate samples, and all the manual annotations are performed by the authors of this study. Therefore our study is not considered human subjects research by our Institutional Review Board (IRB). Also, by doing annotations ourselves, we ensure that no human subjects were exposed to harmful information during our study. Since our work involves the assessment of LLM-driven hate campaigns, it is inevitable to disclose how attackers can evade a hate speech detector. We have taken great care to responsibly share our findings. We disclosed the paper and the labeled dataset to OpenAI, Google Jigsaw, and the developers of open-source detectors. In our disclosure letter, we explicitly highlighted the high attack success rates in the LLM-driven hate campaigns. We have received the acknowledgment from OpenAI and Google Jigsaw.

This repo is intended for research purposes only. Any misuse is strictly prohibited.

Citation

If you find this useful in your research, please consider citing:

@inproceedings{SWQBZZ25,
  author = {Xinyue Shen and Yixin Wu and Yiting Qu and Michael Backes and Savvas Zannettou and Yang Zhang},
  title = {{HateBench: Benchmarking Hate Speech Detectors on LLM-Generated Content and Hate Campaigns}},
  booktitle = {{USENIX Security Symposium (USENIX Security)}},
  publisher = {USENIX},
  year = {2025}
}

Related Skills

qqbot-channel

349.0k

QQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口，自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。

docs-writer

100.3k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

349.0k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

TrustAIRLab

View profile

View on GitHub

GitHub Stars14

CategoryContent

Updated11h ago

Forks3

TrustAIRLab/HateBench

Security Score

95/100

Audited on Apr 5, 2026

No findings