PoliBERTweet
A transformer-based language model trained on politics-related Twitter data. This repo is the official resource of the paper "PoliBERTweet: A Pre-trained Language Model for Analyzing Political Content on Twitter", LREC 2022
Install / Use
/learn @GU-DataLab/PoliBERTweetREADME
🎊 PoliBERTweet: Language Models for Political Tweets
Transformer-based language models pre-trained on a large amount of politics-related Twitter data (83M tweets). This repo is the official resource of the following paper.
📚 Data Sets
The data sets for the evaluation tasks presented in our paper are available below.
🚀 Pre-trained Models
All models are uploaded to my Huggingface 🤗 so you can load model with just three lines of code!!!
- PoliBERTweet (83M tweets) - Feel free to fine-tune this to any downstream task 🎯
- PoliBERTweet-small (5M tweets)
⚙️ Usage
We tested in pytorch v1.10.2 and transformers v4.18.0.
- To fine-tune our models for a specific task (e.g. stance detection), see the HuggingFace Doc
- Please see specific model pages above for more usage details. Below is a sample use case.
1. Load the model and tokenizer
from transformers import AutoModel, AutoTokenizer, pipeline
import torch
# Choose GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Select mode path here
pretrained_LM_path = "kornosk/polibertweet-mlm"
# Load model
tokenizer = AutoTokenizer.from_pretrained(pretrained_LM_path)
model = AutoModel.from_pretrained(pretrained_LM_path)
2. Predict the masked word
# Fill mask
example = "Trump is the <mask> of USA"
fill_mask = pipeline('fill-mask', model=pretrained_LM_path, tokenizer=tokenizer)
outputs = fill_mask(example)
print(outputs)
3. See embeddings
# See embeddings
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)
print(outputs)
# OR you can use this model to train on your downstream task!
# please consider citing our paper if you feel this is useful :)
4. Fine-tune to a downstream task like stance detection
See details in the HuggingFace Doc.
✏️ Citation
If you feel our paper and resources are useful, please consider citing our work! 🙏
@inproceedings{kawintiranon2022polibertweet,
title = {{P}oli{BERT}weet: A Pre-trained Language Model for Analyzing Political Content on {T}witter},
author = {Kawintiranon, Kornraphop and Singh, Lisa},
booktitle = {Proceedings of the Language Resources and Evaluation Conference (LREC)},
year = {2022},
pages = {7360--7367},
publisher = {European Language Resources Association},
url = {https://aclanthology.org/2022.lrec-1.801}
}
🛠 Throubleshoots
Create an issue here if you have any issues loading models or data sets.
Security Score
Audited on Apr 27, 2025
