Duelnlg
Code for ACL 2022 Paper: Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons
Install / Use
/learn @akashkm99/DuelnlgREADME
DuelNLG
This repository contains code for evaluating NLG Models as described in the following paper:
Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons
Akash Kumar Mohankumar, Mitesh M. Khapra. Association for Computational Linguistics (ACL), 2022
Table of Contents
Installation
From Source:
git clone https://github.com/akashkm99/duelnlg.git
cd duelnlg
pip install -e .
To use automatic metrics, you may also need to download nlgeval data:
python ./scripts/download/nlg-eval --setup
Experiments from Paper
Here, we describe the steps to replicate the experiments mentioned in the paper.
Download and Prepare Data
To download and preprocess the WMT 2016 datasets:
bash scripts/preprocess/wmt16.sh
All the processed data will be stored as .pkl files at data/wmt16/processed/
For the WMT 2015 datasets,
bash scripts/preprocess/wmt15.sh
Model Free Algorithms
To perform experiments with model-free dueling bandits algorithms, use the duelnlg/duelpy/experiments/experiments.py script. It has the following arguments:
--feedback-config: A json config that specifies the list of datasets and their parameters. Useconfigs/feedback/wmt_all.jsonto run on all 7 WMT datasets.--algorithm-config: Config file that specifies the dueling bandit algorithms and their parameters. Useconfigs/algorithm/rmed.jsonto run the RMED algorithm and refer toconfigs/algorithm/default_all.jsonfor the default parameters for all algorithms.--output-dir: Directory to save the results. (Default: ./results/bandits)--num-runs: The number of times each algorithm is run with different random seed (Default: 200)--random-seed: The base random seed to use (Default: 42)
For example, to run all the dueling bandits algorithm (except: IF and PL which are quite slow) on the WMT 2016 tur->eng dataset with 50 runs use:
python duelnlg/duelpy/experiments/experiments.py \
--feedback-config ./configs/feedback/wmt16_tur_eng.json \
--algorithm-config ./configs/algorithm/default_all_no_if_pl.json \
--num-runs 50
Model Based Algorithms
1. Download Training and Validation Data
To use direct evaluation metrics, we need to tune a few hyperparameters (e.g. thresholds for the preference probabilities) on a validation set. For training any end-to-end metric for pairwise prediction, we would also require a training set.
To create the train and validation datasets for WMT, we use data from WMT 2013 and 2014:
bash scripts/prepare_train_val/wmt.sh
2. Automatic Evaluation Metrics
To run the Bleurt model, you need to download the model checkpoint:
bash scripts/download/bleurt_ckpt.sh
To run automatic metrics and save the predictions, use the duelnlg/direct_eval/evaluation.py script. It has the following arguments:
--metrics-config: A json config that specifies the list of automatic metrics and their parameters. Useconfigs/metrics/bleurt.jsonto use bleurt and refer toconfigs/metrics/all.jsonto run all metrics.--val-pathandtest-path: CSV files with the validation (for tuning) and test datasets. E.g. for WMT 2016, it's./data/wmt13_14/processed/val.csvanddata/wmt16/processed/wmt16-human-judgements.csvrespectively.--processed-dir: Directory with the processed .pkl files. E.g. for WMT 2016, it'sdata/wmt16/processed--ensemble: Whether to perform mulitple model forward passes with dropout for uncertainity estimation. Applicable only for Bleurt (Default: False)--multiref: Whether the dataset has multiple reference texts. (Default: True)
For example, to run the Bleurt metric on WMT 2016 datasets, use the following:
python duelnlg/direct_eval/evaluation.py \
--metrics ./configs/metrics/bleurt.json \
--val-path ./data/wmt13_14/processed/val.csv \
--test-path ./data/wmt16/processed/wmt16-human-judgements.csv \
--output-results ./results/metrics/bleurt.csv \
--processed-dir ./data/wmt16/processed
Note:
-
Use GPUs to speedup the evaluation. If GPUs are not being used, please check your tensorflow version and CUDA compatibility. (Install a tf (>2.0) version that supports your CUDA version).
-
To accelerate your evaluation with Google Cloud TPUs, refer to
configs/metrics/bleurt_tpu.json. You just need to provide information on your storage bucket & TPU.
Uncertainity Estimation:
To compute uncertainity in the Bleurt scores (required for Uncertainity-aware selection and UCB elimination algos), use the following:
python duelnlg/direct_eval/evaluation.py \
--metrics ./configs/metrics/bleurt_ensemble.json \
--val-path ./data/wmt13_14/processed/val_1k.csv \
--test-path ./data/wmt16/processed/wmt16-human-judgements.csv \
--output-results ./results/metrics/bleurt_ensemble.csv \
--processed-dir ./data/wmt16/processed \
--ensemble
3. Model Based Dueling Bandits
Once you've computed the automatic metric predictions, you can run model-based algorithms by simply adding the flag --model-config to your duelnlg/duelpy/experiments/experiments.py script.
For example to perform random mixing with Bleurt using RMED on the WMT16 tur->eng dataset, use
python duelnlg/duelpy/experiments/experiments.py \
--model-config ./configs/models/random_mixing_blerut.json
--feedback-config ./configs/feedback/wmt16_tur_eng.json \
--algorithm-config ./configs/algorithm/rmed.json \
--num-runs 200
For other model-based algorithms, you can use the following model configs:
| Algorithm | Config |
|-------------------------------------|-------------------------------------------------------------|
| Random Mixing | ./configs/models/random_mixing_bleurt |
| Uncertainity-aware Selection (BALD) | ./configs/models/uncertainity_bleurt.json |
| UCB Elimination | ./configs/models/ucb_elimination_bleurt.json |
| Uncertainity + UCB Elimination | ./configs/models/uncertainity_ucb_elimination_bleurt.json |
Related Skills
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.7kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
