PLMCalibration

Code for ACL 2023 paper "A Close Look into the Calibration of Pre-trained Language Models"

Generate Convert Improve

Install / Use

/learn @lifan-yuan/PLMCalibration

About this skill

Quality Score

0/100

README

PLMCalibration

Code and data for the ACL 2023 paper "A Close Look into the Calibration of Pre-trained Language Models"

Installation

pip install -r requirements.txt

You might also try to run the code with your own version of libraries, but this can lead to some bugs.

Data Preparation

You need to download the datasets from Google Drive [download], and upload the folder (TextClassification) to the ./datasets directory. Then, all the datasets used in the paper can be find in ./datasets/TextClassification .

Experiments

Question 1: Do PLMs learn to become calibrated in the training process?

To answer Question 1, we conduct fine-grained experiments to study the dynamic change in PLMs' calibration performance in training, including dataset difficulty, available training samples, training steps, the number of tunable parameters, model scale, and pre-training. Given that dataset difficulty is reflected by the datasets we choose, we only conduct seperate experiments of other factors to reveal their effects.

Available training samples

Run:

python prompt-shots.py --model_name MODEL_NAME --dataset_name DATASET_NAME --repeats REPEATS --shots SHOTS

By default, the shot number will gradually increase until exceeding the size of the dataset, and the number of repetitions will automatically adjust according to the shot number. The results (probabilities, predictions and gold labels) will be recorded to ./results/shots.

Training steps

Run:

python prompt-dynamics.py --model_name MODEL_NAME --dataset_name DATASET_NAME

The results of every 100 steps will be recorded to ./results/dynamics.

Nnumber of tunable parameters

We consider two kinds of delta-tuning methods, i.e. Adapter and Soft Prompt-tuning.

For Adapter, run:

python prompt-delta.py --model_name MODEL_NAME --dataset_name DATASET_NAME --method adapter --parameter PARAMETER

For Soft Prompt, run:

python prompt-delta.py --model_name MODEL_NAME --dataset_name DATASET_NAME --method soft --parameter PARAMETER

The results will be recorded to ./results/delta-adapter and ./results/delta-soft respectively.

Model scale

Run:

python prompt-scale.py --model_name MODEL_NAME --dataset_name DATASET_NAME --scale SCALE

The results will be recorded to ./results/scale.

Pre-training

We consider pre-trained PLM, random-initialized PLM, LSTM, BoW, and TF-IDF.

For PLM, run:

python prompt-pretrain.py --model_name MODEL_NAME --dataset_name DATASET_NAME --mode MODE

where MODE is pretrain or random.

For LSTM, run:

python train-lstm.py --dataset_name DATASET_NAME

For Bow and TF-IDF, run:

python train-bow-tf_idf.py --model_name MODEL_NAME --dataset_name DATASET_NAME

All the results will be recorded to ./results/pretrain.

Question 2: How effective are existing calibration methods?

To answer Question 2, we implement several calibation methods, including both unlearnable and learnable ones. To eplore their performance under Out-of-Distribution shift settings, we consider various kinds of OOD settings. All of the methods and OOD settings are implemented in one file, so you can simply run:

python prompt-ood.py --model_name MODEL_NAME --dataset_name DATASET_NAME --method METHOD

The OOD setting and calibration methods can be changed by different values of DATASET_NAME and METHOD, respectively. All the results will be recorded to ./results/ood.

Further, we change the size of dataset of the calibration task as well as the scale of the backbone model to explore the emergent ability of learnbale methods. Run:

python prompt-emergent.py --model_name MODEL_NAME --dataset_name DATASET_NAME --method METHOD --scale SCALE --dev_size DEV_SIZE

The results will be recored to ./results/ood/t5-SCALE-DEV_SIZE, where the SCALE and DEV_SIZE are the augments that passed in the command line.

Process Results

So far, we have obtained the results of probabilities, predictions and gold labels. Next, we will use the results to compute the metrics for calibration. Run:

python metric.py --setting_list SETTING_LIST --model_list MODEL_LIST --dataset_list DATASET_LIST

By passing SETTING_LIST, MODEL_LIST and DATASET_LIST, you can find the final metrics for all the experiments in the directory ./metrics.

Related Skills

node-connect

353.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.6k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

353.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

353.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。