HateBR
HateBR is the first large-scale expert annotated dataset of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media.
Install / Use
/learn @franciellevargas/HateBRREADME

HateBR
<div align="center">| class | label | total | |--------------|-------|--------| | offensive | 1 | 3,500 | | non-offensive | 0 | 3,500 | | Total | | 7,000 |
</div>HateBRXplain
<div align="center">| class | label | rationales | total | |--------------|-------|--------------------------------|--------| | offensive | 1 | human-annotated rationales | 3,500 | | non-offensive | 0 | null | 3,500 | | Total | | | 7,000 |
</div> </br>In addition, we also provide baseline machine learning results for both tasks: offensive language and hate speech detection. The best-obtained models are available here in .pkl files. File names are organized as [classification (offensive or hate)_representation (ngram or tfidf)_algorithms (nb, svm, mlp or lr)]. For example, the file offensive_tfidf_svm.pkl presents the model of offensive detection with tf-idf representation using the support vector machine algorithm.
Please cite our paper if you use our dataset:
@inproceedings{vargas-etal-2022-hatebr,
title = "{H}ate{BR}: A Large Expert Annotated Corpus of {B}razilian {I}nstagram Comments for Offensive Language and Hate Speech Detection",
author = "Vargas, Francielle and
Carvalho, Isabelle and
Rodrigues de G{\'o}es, Fabiana and
Pardo, Thiago and
Benevenuto, Fabr{\'\i}cio",
booktitle = "Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022)",
year = "2022",
address = "Marseille, France",
publisher = "European Language Resources Association",
url = "https://aclanthology.org/2022.lrec-1.777",
pages = "7174--7183",
}
@article{Vargas_Carvalho_Pardo_Benevenuto_2024,
author={Vargas, Francielle and Carvalho, Isabelle and Pardo, Thiago A. S. and Benevenuto, Fabrício},
title={Context-aware and expert data resources for Brazilian Portuguese hate speech detection},
DOI={10.1017/nlp.2024.18},
journal={Natural Language Processing},
year={2024},
pages={435-456},
volume{31},
number={2},
url={https://www.cambridge.org/core/journals/natural-language-processing/article/contextaware-and-expert-data-resources-for-brazilian-portuguese-hate-speech-detection/7D9019ED5471CD16E320EBED06A6E923#},
}
@inproceedings{salles-etal-2025-hatebrxplain,
title = "{H}ate{BRX}plain: A Benchmark Dataset with Human-Annotated Rationales for Explainable Hate Speech Detection in {B}razilian {P}ortuguese",
author = "Salles, Isadora and
Vargas, Francielle and
Benevenuto, Fabr{\'i}cio",
editor = "Rambow, Owen and
Wanner, Leo and
Apidianaki, Marianna and
Al-Khalifa, Hend and
Eugenio, Barbara Di and
Schockaert, Steven",
booktitle = "Proceedings of the 31st International Conference on Computational Linguistics (COLING 2025)",
year = "2025",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.coling-main.446/",
pages = "6659--6669",
}
<br></br>
License
This dataset is licensed under the Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0).
<h2 align="left"> Ethics Statements </h2> This dataset contains hateful and offensive content and is intended for research purposes only. Commercial use is not permitted. <h2 align="left"> FUNDING </h2>
Security Score
Audited on Mar 17, 2026
