Duelnlg

Code for ACL 2022 Paper: Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

Generate Convert Improve

Install / Use

/learn @akashkm99/Duelnlg

About this skill

Quality Score

0/100

README

DuelNLG

This repository contains code for evaluating NLG Models as described in the following paper:

Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons
Akash Kumar Mohankumar, Mitesh M. Khapra. Association for Computational Linguistics (ACL), 2022

DuelNLG

Installation

From Source:

git clone https://github.com/akashkm99/duelnlg.git
cd duelnlg
pip install -e .

To use automatic metrics, you may also need to download nlgeval data:

python ./scripts/download/nlg-eval --setup

Experiments from Paper

Here, we describe the steps to replicate the experiments mentioned in the paper.

Download and Prepare Data

To download and preprocess the WMT 2016 datasets:

bash scripts/preprocess/wmt16.sh

All the processed data will be stored as .pkl files at data/wmt16/processed/

For the WMT 2015 datasets,

bash scripts/preprocess/wmt15.sh

Model Free Algorithms

To perform experiments with model-free dueling bandits algorithms, use the duelnlg/duelpy/experiments/experiments.py script. It has the following arguments:

--feedback-config: A json config that specifies the list of datasets and their parameters. Use configs/feedback/wmt_all.json to run on all 7 WMT datasets.
--algorithm-config: Config file that specifies the dueling bandit algorithms and their parameters. Use configs/algorithm/rmed.json to run the RMED algorithm and refer to configs/algorithm/default_all.json for the default parameters for all algorithms.
--output-dir: Directory to save the results. (Default: ./results/bandits)
--num-runs: The number of times each algorithm is run with different random seed (Default: 200)
--random-seed: The base random seed to use (Default: 42)

For example, to run all the dueling bandits algorithm (except: IF and PL which are quite slow) on the WMT 2016 tur->eng dataset with 50 runs use:

python duelnlg/duelpy/experiments/experiments.py \
          --feedback-config ./configs/feedback/wmt16_tur_eng.json \
          --algorithm-config ./configs/algorithm/default_all_no_if_pl.json \
          --num-runs 50

Model Based Algorithms

1. Download Training and Validation Data

To use direct evaluation metrics, we need to tune a few hyperparameters (e.g. thresholds for the preference probabilities) on a validation set. For training any end-to-end metric for pairwise prediction, we would also require a training set.

To create the train and validation datasets for WMT, we use data from WMT 2013 and 2014:

bash scripts/prepare_train_val/wmt.sh

2. Automatic Evaluation Metrics

To run the Bleurt model, you need to download the model checkpoint:

bash scripts/download/bleurt_ckpt.sh

To run automatic metrics and save the predictions, use the duelnlg/direct_eval/evaluation.py script. It has the following arguments:

--metrics-config: A json config that specifies the list of automatic metrics and their parameters. Use configs/metrics/bleurt.json to use bleurt and refer to configs/metrics/all.json to run all metrics.
--val-path and test-path: CSV files with the validation (for tuning) and test datasets. E.g. for WMT 2016, it's ./data/wmt13_14/processed/val.csv and data/wmt16/processed/wmt16-human-judgements.csv respectively.
--processed-dir: Directory with the processed .pkl files. E.g. for WMT 2016, it's data/wmt16/processed
--ensemble: Whether to perform mulitple model forward passes with dropout for uncertainity estimation. Applicable only for Bleurt (Default: False)
--multiref: Whether the dataset has multiple reference texts. (Default: True)

For example, to run the Bleurt metric on WMT 2016 datasets, use the following:

python duelnlg/direct_eval/evaluation.py \
          --metrics ./configs/metrics/bleurt.json \
          --val-path ./data/wmt13_14/processed/val.csv \
          --test-path ./data/wmt16/processed/wmt16-human-judgements.csv \
          --output-results ./results/metrics/bleurt.csv \
          --processed-dir ./data/wmt16/processed

Note:

Use GPUs to speedup the evaluation. If GPUs are not being used, please check your tensorflow version and CUDA compatibility. (Install a tf (>2.0) version that supports your CUDA version).
To accelerate your evaluation with Google Cloud TPUs, refer to configs/metrics/bleurt_tpu.json. You just need to provide information on your storage bucket & TPU.

Uncertainity Estimation:

To compute uncertainity in the Bleurt scores (required for Uncertainity-aware selection and UCB elimination algos), use the following:

python duelnlg/direct_eval/evaluation.py \
        --metrics ./configs/metrics/bleurt_ensemble.json \
        --val-path ./data/wmt13_14/processed/val_1k.csv \
        --test-path ./data/wmt16/processed/wmt16-human-judgements.csv \
        --output-results ./results/metrics/bleurt_ensemble.csv \
        --processed-dir ./data/wmt16/processed \
        --ensemble

3. Model Based Dueling Bandits

Once you've computed the automatic metric predictions, you can run model-based algorithms by simply adding the flag --model-config to your duelnlg/duelpy/experiments/experiments.py script.

For example to perform random mixing with Bleurt using RMED on the WMT16 tur->eng dataset, use

python duelnlg/duelpy/experiments/experiments.py \
          --model-config ./configs/models/random_mixing_blerut.json
          --feedback-config ./configs/feedback/wmt16_tur_eng.json \
          --algorithm-config ./configs/algorithm/rmed.json \
          --num-runs 200

For other model-based algorithms, you can use the following model configs:

| Algorithm | Config | |-------------------------------------|-------------------------------------------------------------| | Random Mixing | ./configs/models/random_mixing_bleurt | | Uncertainity-aware Selection (BALD) | ./configs/models/uncertainity_bleurt.json | | UCB Elimination | ./configs/models/ucb_elimination_bleurt.json | | Uncertainity + UCB Elimination | ./configs/models/uncertainity_ucb_elimination_bleurt.json |

Related Skills

node-connect

349.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.7k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

akashkm99

View profile

View on GitHub

GitHub Stars14

CategoryDevelopment

Updated1y ago

Forks3

akashkm99/duelnlg

Languages

Python

Security Score

60/100

Audited on Jun 4, 2024

No findings

Duelnlg

Install / Use

README

DuelNLG

Table of Contents

Installation

Experiments from Paper

Download and Prepare Data

Model Free Algorithms

Model Based Algorithms

1. Download Training and Validation Data

2. Automatic Evaluation Metrics

Note:

Uncertainity Estimation:

3. Model Based Dueling Bandits

Related Skills