SentiLARE
Codes for our paper "SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge" (EMNLP 2020)
Install / Use
/learn @thu-coai/SentiLAREREADME
SentiLARE: Sentiment-Aware Language Representation Learning with Linguistic Knowledge
Introduction
SentiLARE is a sentiment-aware pre-trained language model enhanced by linguistic knowledge. You can read our paper for more details. This project is a PyTorch implementation of our work.
Dependencies
- Python 3
- NumPy
- Scikit-learn
- PyTorch >= 1.3.0
- PyTorch-Transformers (Huggingface) 1.2.0
- TensorboardX
- Sentence Transformers 0.2.6 (Optional, used for linguistic knowledge acquisition during pre-training and fine-tuning)
- NLTK (Optional, used for linguistic knowledge acquisition during pre-training and fine-tuning)
Quick Start for Fine-tuning
Datasets of Downstream Tasks
Our experiments contain sentence-level sentiment classification (e.g. SST / MR / IMDB / Yelp-2 / Yelp-5) and aspect-level sentiment analysis (e.g. Lap14 / Res14 / Res16). You can download the pre-processed datasets (Google Drive / Tsinghua Cloud) of the downstream tasks. The detailed description of the data formats is attached to the datasets.
Fine-tuning
To quickly conduct the fine-tuning experiments, you can directly download the checkpoint (Google Drive / Tsinghua Cloud) of our pre-trained model. We show the example of fine-tuning SentiLARE on SST as follows:
cd finetune
CUDA_VISIBLE_DEVICES=0,1,2 python run_sent_sentilr_roberta.py \
--data_dir data/sent/sst \
--model_type roberta \
--model_name_or_path pretrain_model/ \
--task_name sst \
--do_train \
--do_eval \
--max_seq_length 256 \
--per_gpu_train_batch_size 4 \
--learning_rate 2e-5 \
--num_train_epochs 3 \
--output_dir sent_finetune/sst \
--logging_steps 100 \
--save_steps 100 \
--warmup_steps 100 \
--eval_all_checkpoints \
--overwrite_output_dir
Note that data_dir is set to the directory of pre-processed SST dataset, and model_name_or_path is set to the directory of the pre-trained model checkpoint. output_dir is the directory to save the fine-tuning checkpoints. You can refer to the fine-tuning codes to get the description of other hyper-parameters.
More details about fine-tuning SentiLARE on other datasets can be found in finetune/README.MD.
POS Tagging and Polarity Acquisition for Downstream Tasks
During pre-processing, we tokenize the original datasets with NLTK, tag the sentences with Stanford Log-Linear Part-of-Speech Tagger, and obtain the sentiment polarity with Sentence-BERT.
Pre-training
If you want to conduct pre-training by yourself instead of directly using the checkpoint we provide, this part may help you pre-process the pre-training dataset and run the pre-training scripts.
Dataset
We use Yelp Dataset Challenge 2019 as our pre-training dataset. According to the Term of Use of Yelp dataset, you should download Yelp dataset on your own.
POS Tagging and Polarity Acquisition for Pre-training Dataset
Similar to fine-tuning, we also conduct part-of-speech tagging and sentiment polarity acquisition on the pre-training dataset. Note that since the pre-training dataset is quite large, the pre-processing procedure may take a long time because we need to use Sentence-BERT to obtain the representation vectors of all the sentences in the pre-training dataset.
Pre-training
Refer to pretrain/README.MD for more implementation details about pre-training.
Citation
@inproceedings{ke-etal-2020-sentilare,
title = "{S}enti{LARE}: Sentiment-Aware Language Representation Learning with Linguistic Knowledge",
author = "Ke, Pei and Ji, Haozhe and Liu, Siyang and Zhu, Xiaoyan and Huang, Minlie",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
pages = "6975--6988",
}
Please kindly cite our paper if this paper and the codes are helpful.
Thanks
Many thanks to the GitHub repositories of Transformers and BERT-PT. Part of our codes are modified based on their codes.
Related Skills
YC-Killer
2.7kA library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.
flutter-tutor
Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d
groundhog
400Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).
workshop-rules
Materials used to teach the summer camp <Data Science for Kids>
