FakeWhatsApp.Br

An annotated Corpus of WhatsApp messages in PT-BR for automatic detection of textual misinformation.

Generate Convert Improve

Install / Use

/learn @cabrau/FakeWhatsApp.Br

About this skill

Quality Score

0/100

README

FakeWhatsApp.Br

An annotated Corpus of anonymized WhatsApp messages in PT-BR public groups for automatic detection of textual misinformation and malicious users. To get detailed information about the construction and experimentation of the corpus, check out our paper published in ICEIS 2021 conference:

Cabral, Lucas, et al. "Fakewhastapp. br: NLP and machine learning techniques for misinformation detection in brazilian portuguese whatsapp messages." Proceedings of the 23rd International Conference on Enterprise Information Systems, ICEIS. 2021.

If you use our corpus, please include a citation to our corresponding paper. For futher discussion and experiments, you can check out my master thesis (in portuguese): https://repositorio.ufc.br/handle/riufc/63379

Data

The data collected during 2018 brazilian presidential ellections is located at:

data/2018/fakeWhatsApp.BR_2018.csv

The data is stored in a CSV file, where each line is a message sent in a public group. The dictionary of variables is the following:

id: unique ID of a user
date: day of the year that the message was sent
ddi: international identifier
country: country assigned to the ddi
country_iso3: ISO3 code of country
ddd: regional brazilian telephone code
state: brazilian state
midia: boolean variable indicating if the message is a media file (1) or not (0)
url: boolean variable indicating if the message contains an url (1) or don't (0)
characters: number of characters in message's text
words: number of words in message's text
viral: boolean variable indicating if a message with the exactly same text and more of 5 words appears in the corpus (1) or don't (0). The viral messages were the ones manually labelled.
shares: number of times that a message with the exactly same text appears in the corpus
text: textual content of message
misinformation: manually assigned label if the message contains misinformation (1) or don't (1). The value -1 means that the message was not labelled.

Notebooks:

1 - parser.ipynb This notebook parses the data collected in WhatsApp groups, converting from free text format to structured data in a CSV table.
2 - labeling and anonymization.ipynb In this notebook we transfer the labels annotated manually in the viral messages to the entire corpus and remove personal data such as phone numbers present in the text.
3 - exploratory analysis.ipynb Exploration and visualization of the data set.
4 - compare corpora.ipynb Comparison with fake news corpus on Twitter to demonstrate the need for a corpus of WhatsApp texts.
5 - misinformation detection ml.ipynb Experiments with classical machine learning models to classify textual misinformation.
6 - deep learning char level cnn.ipynb Experiments with a character level convolutional neural network to classify textual misinformation.
7 - user features.ipynb Exploiting user features to detect misinformation
8 - user classification.ipynb Experiments classifying users as superspreaders
9 - automatic dataset expansion.ipynb Experiments with automatic expansion of dataset using cosine similarity
10 - user credibility.ipynb Modeling user credibility to improve misinformation detection

Related Skills

node-connect

332.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

81.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

332.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

81.7k

Commit, push, and open a PR