Speculative Decoding

Introduction

This repository contains the code used to reimplement our paper: Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation.

SpecDec

Download model

| Description | Model | | ----------- | ------------------------------------------------------------ | | wmt14.en-de | ar-verifier-base， nar-drafter-base (k=25) | | wmt14.de-en | ar-verifier-base， nar-drafter-base (k=25) | | wmt16.en-ro | ar-verifier-base， nar-drafter-base (k=25) | | wmt16.ro-en | ar-verifier-base， nar-drafter-base (k=25) |

Requirements

Python >= 3.7
Pytorch >= 1.5.0

Installation

conda create -n specdec python=3.7
cd SpecDec
pip install --editable .

Preprocess

The datasets we used can be obtained following the script released by Mask-Predict. We release the bpe codes and our dicts in ./data. The tokenized datasets are preprocessed as follows:

text=PATH_TO_YOUR_DATA
src=source_language
tgt=target_language
bin_path=PATH_TO_BIN_DIR

model_path=PATH_TO_MODEL_DICT_DIR

fairseq-preprocess --source-lang ${src} --target-lang ${tgt} \
    --trainpref $text/train --validpref $text/valid --testpref $text/test \
    --destdir ${bin_path} --workers 60 \
    --srcdict ${model_path}/dict.${src}.txt \
    --tgtdict ${model_path}/dict.${tgt}.txt

Encoder Initialization

We recommend using the AR verifier's encoder to initialize the weights of the NAR drafter. For preparing the initialization checkpoints, check encoder_initial.py.

Train

The AR verifier of SpecDec is a standard Transformer that can be trained with fairseq:

fairseq-train ${bin_path} --arch transformer --share-all-embeddings \
      --task translation --source-lang ${src} --target-lang ${tgt} \
      --criterion label_smoothed_cross_entropy --dropout ${dropout} \
      --label-smoothing 0.1 --lr ${lr} --clip-norm 3.0 \
      --warmup-init-lr 1e-7 --min-lr 1e-9 --lr-scheduler inverse_sqrt \
      --weight-decay 0.00001 --update-freq ${update_freq} --fp16 --seed ${seed} \
      --warmup-updates ${warmup} --optimizer adam \
      --adam-betas '(0.9, 0.98)' --max-tokens ${max_tokens} --max-epoch ${max_epoch} \
      --save-dir ./checkpoints \
      --eval-bleu \
      --eval-bleu-args '{"beam":5}' \
      --eval-bleu-detok moses \
      --eval-bleu-remove-bpe \
      --eval-bleu-print-samples \
      --best-checkpoint-metric bleu --maximize-best-checkpoint-metric

For training the NAR drafter of SpecDec (check train.sh):

python train.py ${bin_path} --arch block --noise block_mask --share-all-embeddings \
    --criterion glat_loss --label-smoothing 0.1 --lr ${lr} --warmup-init-lr 1e-7 \
    --stop-min-lr 1e-9 --lr-scheduler inverse_sqrt --warmup-updates ${warmup} \
    --optimizer adam --adam-betas '(0.9, 0.999)' --adam-eps 1e-6 \
    --task translation_lev_modified --max-tokens ${max_tokens} --weight-decay 0.01 \
    --dropout ${dropout} --encoder-layers 6 --encoder-embed-dim 512 --decoder-layers 6 \
    --decoder-embed-dim 512 --fp16 --max-source-positions 1000 \
    --max-target-positions 1000 --max-update ${update} --seed ${seed} --clip-norm 5 \
    --save-dir ./checkpoints --src-embedding-copy --log-interval 1000 \
    --user-dir specdec_plugins --block-size ${size} --total-up ${update} \
    --update-freq ${update_freq} --decoder-learned-pos --encoder-learned-pos \
    --apply-bert-init --activation-fn gelu \
    --restore-file ./checkpoints/initial_checkpoint.pt \
    --reset-optimizer --reset-meters --reset-lr-scheduler --reset-dataloader

Hyperparameters

The hyperparameters of the NAR drafter are shown as follows:

| Hyperparameters \ Datasets | WMT14 EN-DE | WMT16 EN-RO | | -------------------------- | :---------: | :---------: | | learning rate | 0.0005 | 0.001 | | dropout | 0.1 | 0.2 | | warm up | 10000 | 4000 | | max update | 300K | 50K | | batch size (tokens) | 128K | 64K |

the effective batch size of tokens is calculated by GPU_NUM * MAX_TOKENS * UPDATE_FREQ.

Inference

For SpecDec (check inference.sh, set beta=1 for identical results to AR greedy decoding):

python inference.py ${data_dir} --path ${checkpoint_path} --user-dir specdec_plugins \
    --task translation_lev_modified --remove-bpe --max-sentences 20 \
    --source-lang ${src} --target-lang ${tgt} --iter-decode-max-iter 0 \
    --iter-decode-eos-penalty 0 --iter-decode-with-beam 1 --gen-subset test \
    --AR-path ${AR_checkpoint_path} --input-path ${input_path} \
    --output-path ${output_path} --block-size ${block_size} --beta ${beta} --tau ${tau} \
    --batch ${batch} --beam ${beam} --strategy ${strategy}

We test the inference latency of SpecDec with batch 1 implementation, check inference_paper.py for details.

check inference_drafter.py for inference with our NAR drafter only.

Calculating compound split bleu:

./ref.sh

Example

We put the first three tokenized sentences of WMT14 EN-DE in data/wmt14.en-de/example.en. Put this file in the input_path of the inference script. The results below were obtained by running inference.sh with inference_paper.py (on 1 Nvidia P100 GPU, Pytorch 1.10, CUDA 11).

| Model | Accepted Tokens (average) | Latency (s) | | --------------- | :-----------------------: | :---------: | | Fairseq (beam5) | 1.00 | 0.83 | | Fairseq (beam1) | 1.00 | 0.81 | | SpecDec | 6.18 | 0.27 |

You can find the translation results in ./output.

Extra Memory Cost

Since there is no need to save intermediate variables during inference, SpecDec can achieve 3x~5x decoding speedup (by alternating NAR and AR decoding) with only ~300MiB of extra memory cost. Below is the nvidia-smi memory cost comparison of AR and SpecDec, tested on WMT14 EN-DE:

| Model \ Batch Size | Model States (Params) | 1 | 4 | 8 | 16 | 32 | | ------------------ | :-------------------: | :--: | :--: | :--: | :--: | :--: | | Fairseq (beam1) | 232.38 | 1670 | 1712 | 1758 | 1844 | 2028 | | SpecDec | 469.75 (AR + NAR) | 1902 | 1938 | 2012 | 2108 | 2298 | | Extra Memory | 237.38 (NAR) | 232 | 226 | 254 | 264 | 270 |

Note

This code is based on GLAT (https://github.com/FLC777/GLAT).

Citation

If you find the resources in this repository useful, please cite our paper:

@inproceedings{xia-etal-2023-speculative,
    title = "Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation",
    author = "Xia, Heming  and
      Ge, Tao  and
      Wang, Peiyi  and
      Chen, Si-Qing  and
      Wei, Furu  and
      Sui, Zhifang",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.257",
    pages = "3909--3925",
}