XLM

NEW: Added XLM-R model.

PyTorch original implementation of Cross-lingual Language Model Pretraining. Includes:

Monolingual language model pretraining (BERT)
Cross-lingual language model pretraining (XLM)
Applications: Supervised / Unsupervised MT (NMT / UNMT)
Applications: Cross-lingual text classification (XNLI)
Product-Key Memory Layers (PKM)

Model

XLM supports multi-GPU and multi-node training, and contains code for:

Language model pretraining:
- Causal Language Model (CLM)
- Masked Language Model (MLM)
- Translation Language Model (TLM)
GLUE fine-tuning
XNLI fine-tuning
Supervised / Unsupervised MT training:
- Denoising auto-encoder
- Parallel data training
- Online back-translation

Installation

Install the python package in editable mode with

pip install -e .

Dependencies

Python 3
NumPy
PyTorch (currently tested on version 0.4 and 1.0)
fastBPE (generate and apply BPE codes)
Moses (scripts to clean and tokenize text only - no installation required)
Apex (for fp16 training)

I. Monolingual language model pretraining (BERT)

In what follows we explain how you can download and use our pretrained XLM (English-only) BERT model. Then we explain how you can train your own monolingual model, and how you can fine-tune it on the GLUE tasks.

Pretrained English model

We provide our pretrained XLM_en English model, trained with the MLM objective.

| Languages | Pretraining | Model | BPE codes | Vocabulary | | ---------------- | ----------- |:-------------------------------------------------------------------:|:-------------------------------------------------------------:| --------------------------------------------------------------:| | English | MLM | Model | BPE codes | Vocabulary |

which obtains better performance than BERT (see the GLUE benchmark) while trained on the same data:

Model | Score | CoLA | SST2 | MRPC | STS-B | QQP | MNLI_m | MNLI_mm | QNLI | RTE | WNLI | AX |:---: |:---: |:---: | :---: |:---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | BERT | 80.5 | 60.5 | 94.9 | 89.3/85.4 | 87.6/86.5 | 72.1/89.3 | 86.7 | 85.9 | 92.7 | 70.1 | 65.1 | 39.6 XLM_en | 82.8 | 62.9 | 95.6 | 90.7/87.1 | 88.8/88.2 | 73.2/89.8 | 89.1 | 88.5 | 94.0 | 76.0 | 71.9 | 44.7

If you want to play around with the model and its representations, just download the model and take a look at our ipython notebook demo.

Our XLM PyTorch English model is trained on the same data than the pretrained BERT TensorFlow model (Wikipedia + Toronto Book Corpus). Our implementation does not use the next-sentence prediction task and has only 12 layers but higher capacity (665M parameters). Overall, our model achieves a better performance than the original BERT on all GLUE tasks (cf. table above for comparison).

Train your own monolingual BERT model

Now it what follows, we will explain how you can train a similar model on your own data.

1. Preparing the data

First, get the monolingual data (English Wikipedia, the TBC corpus is not hosted anymore).

# Download and tokenize Wikipedia data in 'data/wiki/en.{train,valid,test}'
# Note: the tokenization includes lower-casing and accent-removal
./get-data-wiki.sh en

Install fastBPE and learn BPE vocabulary (with 30,000 codes here):

OUTPATH=data/processed/XLM_en/30k  # path where processed files will be stored
FASTBPE=tools/fastBPE/fast  # path to the fastBPE tool

# create output path
mkdir -p $OUTPATH

# learn bpe codes on the training set (or only use a subset of it)
$FASTBPE learnbpe 30000 data/wiki/txt/en.train > $OUTPATH/codes

Now apply BPE tokenization to train/valid/test files:

$FASTBPE applybpe $OUTPATH/train.en data/wiki/txt/en.train $OUTPATH/codes &
$FASTBPE applybpe $OUTPATH/valid.en data/wiki/txt/en.valid $OUTPATH/codes &
$FASTBPE applybpe $OUTPATH/test.en data/wiki/txt/en.test $OUTPATH/codes &

and get the post-BPE vocabulary:

cat $OUTPATH/train.en | $FASTBPE getvocab - > $OUTPATH/vocab &

Binarize the data to limit the size of the data we load in memory:

# This will create three files: $OUTPATH/{train,valid,test}.en.pth
# After that we're all set
python preprocess.py $OUTPATH/vocab $OUTPATH/train.en &
python preprocess.py $OUTPATH/vocab $OUTPATH/valid.en &
python preprocess.py $OUTPATH/vocab $OUTPATH/test.en &

2. Train the BERT model

Train your BERT model (without the next-sentence prediction task) on the preprocessed data:


python train.py

## main parameters
--exp_name xlm_en                          # experiment name
--dump_path ./dumped                       # where to store the experiment

## data location / training objective
--data_path $OUTPATH                       # data location
--lgs 'en'                                 # considered languages
--clm_steps ''                             # CLM objective (for training GPT-2 models)
--mlm_steps 'en'                           # MLM objective

## transformer parameters
--emb_dim 2048                             # embeddings / model dimension (2048 is big, reduce if only 16Gb of GPU memory)
--n_layers 12                              # number of layers
--n_heads 16                               # number of heads
--dropout 0.1                              # dropout
--attention_dropout 0.1                    # attention dropout
--gelu_activation true                     # GELU instead of ReLU

## optimization
--batch_size 32                            # sequences per batch
--bptt 256                                 # sequences length  (streams of 256 tokens)
--optimizer adam_inverse_sqrt,lr=0.00010,warmup_updates=30000,beta1=0.9,beta2=0.999,weight_decay=0.01,eps=0.000001  # optimizer (training is quite sensitive to this parameter)
--epoch_size 300000                        # number of sentences per epoch
--max_epoch 100000                         # max number of epochs (~infinite here)
--validation_metrics _valid_en_mlm_ppl     # validation metric (when to save the best model)
--stopping_criterion _valid_en_mlm_ppl,25  # stopping criterion (if criterion does not improve 25 times)
--fp16 true                                # use fp16 training

## bert parameters
--word_mask_keep_rand '0.8,0.1,0.1'        # bert masking probabilities
--word_pred '0.15'                         # predict 15 percent of the words

## There are other parameters that are not specified here (see train.py).

To train with multiple GPUs use:

export NGPU=8; python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

Tips: Even when the validation perplexity plateaus, keep training your model. The larger the batch size the better (so using multiple GPUs will improve performance). Tuning the learning rate (e.g. [0.0001, 0.0002]) should help.

3. Fine-tune a pretrained model on GLUE tasks

Now that the model is pretrained, let's finetune it. First, download and preprocess the GLUE tasks:

# Download and tokenize GLUE tasks in 'data/glue/{MNLI,QNLI,SST-2,STS-B}'

./get-data-glue.sh

# Preprocessing should be the same than for training.
# If you removed lower-casing/accent-removal, it sould be reflected here as well.

and prepare the GLUE data using the codes and vocab:

# by default this script uses the BPE codes and vocab of pretrained XLM_en. Modify in script if needed.
./prepare-glue.sh

In addition to the train.py script, we provide a complementary script glue-xnli.py to fine-tune a model on either GLUE or XNLI.

You can now fine-tune the pretrained model on one of the English GLUE tasks using this config:

# Config used for fine-tuning our pretrained English BERT model (mlm_en_2048.pth)
python glue-xnli.py
--exp_name test_xlm_en_glue              # experiment name
--dump_path ./dumped                     # where to store the experiment
--model_path mlm_en_2048.pth             # model location
--data_path $OUTPATH                     # data location
--transfer_tasks MNLI-m,QNLI,SST-2       # transfer tasks (GLUE tasks)
--optimizer_e adam,lr=0.000025           # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125])
--optimizer_p adam,lr=0.000025           # optimizer of projection (lr \in [0.000005, 0.000025, 0.000125])
--finetune_layers "0:_1"                 # fine-tune all layers
--batch_size 8                           # batch size (\in [4, 8])
--n_epochs 250                           # number of epochs
--epoch_size 20000                       # number of sentences per epoch (relatively sma

XLM

Install / Use

README

XLM