REDER

[NeurIPS 2021] Duplex Sequence-to-Sequence Learning for Reversible Machine Translation

Generate Convert Improve

Install / Use

/learn @zhengzx-nlp/REDER

About this skill

Quality Score

0/100

README

REDER

[NeurIPS 2021] Duplex Sequence-to-Sequence Learning for Reversible Machine Translation

Update

Dec 8, 2021: code cleansing and refactoring. (not fully tested)

TODO

fully tested the code, and elaborate README when I am not that busy.

Requirement

Our model is built on fairseq

fairseq==0.9.0
pytorch==1.6.0
imputer-pytorch (https://github.com/rosinality/imputer-pytorch)
ctcdecode (https://github.com/parlance/ctcdecode.git)

Install by

git clone https://github.com/zhengzx-nlp/REDER.git && cd REDER
bash nonauto/run/install.sh

Training

Data Processing

We follow the standard procedure provided by the scripts in fairseq. Here we use iwslt14.de-en as an example. This is the script prepare-iwslt14.sh.

Download and prepare raw data

# Download and prepare the data
bash prepare-iwslt14.sh 

# Preprocess/binarize the data
TEXT=/path/to/iwslt14.tokenized.de-en
src=de
tgt=en 

fairseq-preprocess --source-lang $src --target-lang $tgt \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir /path/to/data-bin/iwslt14.tokenized.de-en \
    --workers 20 --joint-dictionary

Training an AT model

export CUDA_VISIBLE_DEVICES=0

EXP_NAME="iwslt14.de-en.transformer"
mkdir $EXP_NAME && cd $EXP_NAME

fairseq-train \
    /path/to/data-bin/iwslt14.tokenized.de-en \
    -s $src -t $tgt \
    --arch transformer_iwslt_de_en --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 \
    --lr 5e-4 --lr-scheduler inverse_sqrt --warmup-updates 4000 \
    --dropout 0.3 --weight-decay 0.0001 \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --max-tokens 4096 \
    --eval-bleu \
    --eval-bleu-args '{"beam": 5, "max_len_a": 1.2, "max_len_b": 10}' \
    --eval-bleu-detok moses \
    --eval-bleu-remove-bpe \
    --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
    --tensorboard-logdir "logs/$EXP_NAME"

Create sequence-level KD data using the AT model

Translate the whole training data and use the translation results as the dataset for training NAT instead of ground-truth target translation.

 export CUDA_VISIBLE_DEVICES=0

 mkdir -p results

fairseq-generate ${data} --fp16 \
    --gen-subset train -s $src -t $tgt \
    --path checkpoints/checkpoint_best.pt \
    --batch-size 1024 --max-tokens 8192 --beam 4 --remove-bpe \
    > results/train.kd.gen

grep ^S results/train.kd.gen | cut -f2- > train.$src
grep ^H results/train.kd.gen | cut -f3- > train.kd.$tgt

Extract plain texts

output=results/train.kd.gen
grep ^S results/train.kd.gen | cut -f2- > train.$src
grep ^H results/train.kd.gen | cut -f3- > train.kd.$tgt

Process/binarize data to fairseq format

data=/path/to/iwslt14.tokenized.de-en/
distil_data=/path/to/iwslt14.tokenized.distil.de-en

# apply bpe using original code
mkdir -p ${distil_data}
 
mv train.$src train.kd.$tgt $distil_data
 
cp $data/code $data/valid.* $data/test.* ${distil_data}

cd $distil_data
subword-nmt apply-bpe -c code < train.kd.$src > train.$src && rm train.kd.$src
subword-nmt apply-bpe -c code < train.kd.$tgt > train.$tgt && rm train.kd.$tgt

fairseq-preprocess --source-lang $src --target-lang $tgt \
    --trainpref train --validpref valid --testpref test \
    --destdir binarized \
    --workers 20 \
    --srcdict $data/binarized/dict.$src.txt \
    --tgtdict $data/binarized/dict.$tgt.txt

Create bidirectional KD data

Do the same for the reverse direction (en-de) reusing the same vocabulary, and put the binarized files in the same data fold together with de-en.

After that, we will get a data folder having a structure like this:

/path/to/data-bin/iwslt14.tokenized.distil.de-en
├── dict.de.txt      
├── dict.en.txt      
├── preprocess.log   
├── train.de-en.de.bin
├── train.de-en.de.idx
├── train.de-en.en.bin
├── train.de-en.en.idx
├── valid.de-en.de.bin
├── valid.de-en.de.idx
├── valid.de-en.en.bin
├── valid.de-en.en.idx
├── test.de-en.de.bin
├── test.de-en.de.idx
├── test.de-en.en.bin
└── test.de-en.en.idx
├── train.en-de.de.bin
├── train.en-de.de.idx
├── train.en-de.en.bin
├── train.en-de.en.idx
├── valid.en-de.de.bin
├── valid.en-de.de.idx
├── valid.en-de.en.bin
├── valid.en-de.en.idx
├── test.en-de.de.bin
├── test.en-de.de.idx
├── test.en-de.en.bin
└── test.en-de.en.idx

2. Training REDER

see nonauto/run/train_REDER.sh

Generation

see nonauto/run/gen_REDER.sh

Example

Please check out experiments folder for an excutable complete example on iwslt14 en-de.

Citation

@inproceedings{zheng2021REDER,
  title={Duplex Sequence-to-Sequence Learning for Reversible Machine Translation},
  author={Zheng, Zaixiang and Zhou, Hao and Huang, Shujian and Chen, Jiajun and Xu, Jingjing and Li, Lei},
  booktitle={NeurIPS},
  year={2021}
}

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

isf-agent

a repo for an agent that helps researchers apply for isf funding