FakeWhatsApp.Br
An annotated Corpus of WhatsApp messages in PT-BR for automatic detection of textual misinformation.
Install / Use
/learn @cabrau/FakeWhatsApp.BrREADME
FakeWhatsApp.Br
An annotated Corpus of anonymized WhatsApp messages in PT-BR public groups for automatic detection of textual misinformation and malicious users. To get detailed information about the construction and experimentation of the corpus, check out our paper published in ICEIS 2021 conference:
Cabral, Lucas, et al. "Fakewhastapp. br: NLP and machine learning techniques for misinformation detection in brazilian portuguese whatsapp messages." Proceedings of the 23rd International Conference on Enterprise Information Systems, ICEIS. 2021.
If you use our corpus, please include a citation to our corresponding paper. For futher discussion and experiments, you can check out my master thesis (in portuguese): https://repositorio.ufc.br/handle/riufc/63379
Data
The data collected during 2018 brazilian presidential ellections is located at:
data/2018/fakeWhatsApp.BR_2018.csv
The data is stored in a CSV file, where each line is a message sent in a public group. The dictionary of variables is the following:
id: unique ID of a userdate: day of the year that the message was sentddi: international identifiercountry: country assigned to the ddicountry_iso3: ISO3 code of countryddd: regional brazilian telephone codestate: brazilian statemidia: boolean variable indicating if the message is a media file (1) or not (0)url: boolean variable indicating if the message contains an url (1) or don't (0)characters: number of characters in message's textwords: number of words in message's textviral: boolean variable indicating if a message with the exactly same text and more of 5 words appears in the corpus (1) or don't (0). The viral messages were the ones manually labelled.shares: number of times that a message with the exactly same text appears in the corpustext: textual content of messagemisinformation: manually assigned label if the message contains misinformation (1) or don't (1). The value -1 means that the message was not labelled.
Notebooks:
-
1 - parser.ipynb<br> This notebook parses the data collected in WhatsApp groups, converting from free text format to structured data in a CSV table. -
2 - labeling and anonymization.ipynb<br> In this notebook we transfer the labels annotated manually in the viral messages to the entire corpus and remove personal data such as phone numbers present in the text. -
3 - exploratory analysis.ipynb<br> Exploration and visualization of the data set. -
4 - compare corpora.ipynb<br> Comparison with fake news corpus on Twitter to demonstrate the need for a corpus of WhatsApp texts. -
5 - misinformation detection ml.ipynb<br> Experiments with classical machine learning models to classify textual misinformation. -
6 - deep learning char level cnn.ipynb<br> Experiments with a character level convolutional neural network to classify textual misinformation. -
7 - user features.ipynb<br> Exploiting user features to detect misinformation -
8 - user classification.ipynb<br> Experiments classifying users as superspreaders -
9 - automatic dataset expansion.ipynb<br> Experiments with automatic expansion of dataset using cosine similarity -
10 - user credibility.ipynb<br> Modeling user credibility to improve misinformation detection
Related Skills
node-connect
332.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
81.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
332.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
81.7kCommit, push, and open a PR
