Structure-Level Knowledge Distillation for Multilingual NLP

The code is mainly for our ACL 2020 paper: Structure-Level Knowledge Distillation For Multilingual Sequence Labeling A framework for training unified multilingual models with knowledge distillation, the code is mainly based on flair version 0.4.3 with a lot of modifications. In this repo, we include the following attributes:

|Task |Monolingual|Multilingual|Finetuning|Knowledge Distillation|Notes| |-----|-----------|------------|----------|----------------------|-------| |Sequence Labeling|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|Structure-level knowledge distillation (Wang et al., 2020)| |Dependency Parsing|:heavy_check_mark:|:heavy_check_mark:|:heavy_check_mark:|:x:|State-of-the-Art Parser for Enhanced Universal Dependencies in IWPT 2020 shared task (Wang et al., 2020) and State-of-the-Art Parser for Semantic Dependency Parsing (Wang et al., 2019)|

Training Sequence Labelers

Requirements and Installation

The project is based on PyTorch 1.1+ and Python 3.6+.

pip install -r requirements.txt

Teacher Models

Let's train multilingual CoNLL named entity recognition (NER) model as an example. First we need to prepare the teacher models by downloading the pretrained teacher models on google drive and put these models in resources/taggers.

An alternative way is training the teacher models by yourself:

python train_with_teacher.py --config config/multi_bert_origflair_300epoch_2000batch_0.1lr_256hidden_de_monolingual_crf_sentloss_10patience_baseline_nodev_ner0.yaml
python train_with_teacher.py --config config/multi_bert_origflair_300epoch_2000batch_0.1lr_256hidden_en_monolingual_crf_sentloss_10patience_baseline_nodev_ner0.yaml
python train_with_teacher.py --config config/multi_bert_origflair_300epoch_2000batch_0.1lr_256hidden_es_monolingual_crf_sentloss_10patience_baseline_nodev_ner1.yaml
python train_with_teacher.py --config config/multi_bert_origflair_300epoch_2000batch_0.1lr_256hidden_nl_monolingual_crf_sentloss_10patience_baseline_nodev_ner1.yaml

Training the Multilingual Model without M-BERT finetuning

Knowledge Distillation

After all teacher models are ready, we can train the unified multilingual model with

Posterior distillation To reproduce the accuracy in our paper, run:

python train_with_teacher.py --config config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_posterior_2.25temperature_old_relearn_nodev_fast_new_ner0.yaml

We also find that larger temperature can lead to better results:

python train_with_teacher.py --config config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_posterior_4temperature_old_relearn_nodev_fast_new_ner0.yaml

Top-K distillation

python train_with_teacher.py --config config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_1best_old_relearn_nodev_fast_new_ner0.yaml

Top-WK distillation

python train_with_teacher.py --config config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_crfatt_old_relearn_nodev_fast_new_ner0.yaml

Posterior+Top-WK distillation

python train_with_teacher.py --config config/multi_bert_300epoch_0.5anneal_2000batch_0.1lr_600hidden_multilingual_crf_sentloss_10patience_distill_fast_crfatt_posterior_4temperature_both_old_relearn_nodev_fast_new_ner1.yaml

Training the Multilingual Model with M-BERT finetuning

Finetuning M-BERT without the CRF layer

Following the example of transformers, we use a learning rate of 5e-5 for M-BERT finetuning, run:

python train_with_teacher.py --config config/multi_bert_10epoch_2000batch_0.00005lr_multilingual_nocrf_sentloss_baseline_fast_finetune_relearn_nodev_ner0.yaml

Finetuning M-BERT with the CRF layer

The key for finetuning M-BERT with the CRF layer is setting a larger learning rate for the transition table while the M-BERT layer with a small learning rate (0.5 here), to train the model, run:

python train_with_teacher.py --config config/multi_bert_10epoch_2000batch_0.00005lr_10000lrrate_5decay_800hidden_multilingual_crf_sentloss_baseline_fast_finetune_relearn_nodev_ner0.yaml

Posterior distillation To distill the posterior distribution with finetuning M-BERT model, run:

python train_with_teacher.py --config config/multi_bert_10epoch_10anneal_2000batch_0.00005lr_10000lrrate_5decay_800hidden_multilingual_crf_sentloss_distill_posterior_4temperature_fast_finetune_relearn_nodev_ner1.yaml

Performance

Performance on CoNLL-02/03 NER with finetuning M-BERT are (average over 3 runs):

|Finetune|CRF|Knowledge Distillation|English|Dutch|Spanish|German|Average| |-----|-----------|------------|----------|----------------------|-------|-------|-------| |:heavy_check_mark:|:x:|:x:|91.09|90.34|87.88|82.59|87.97| |:heavy_check_mark:|:heavy_check_mark:|:x:|91.47|90.97|88.15|82.80|88.35| |:heavy_check_mark:|:heavy_check_mark:|Posterior|91.63|91.38|88.78|83.21|88.75|

Training Dependency Parsers

The dependency parsering module is based on the code of parser, our parser is also able to parse the semantic dependency parsing (Oepen et al., 2014) with second-order mean-field variational inference (Wang et al., 2019).

Multilingual Syntactic Dependency Parsing

For multilingal syntactic dependency parsing, we run on Universal Dependencies as an example:

python train_with_teacher.py --config config/multi_bert_1000epoch_0.5inter_3000batch_0.002lr_400hidden_multilingual_nocrf_fast_nodev_dependency0.yaml

Training the model with BERT finetuning:

python train_with_teacher.py --config config/multi_bert_10epoch_0.5inter_3000batch_0.00005lr_20lrrate_multilingual_nocrf_fast_warmup_freezing_beta_weightdecay_finetune_nodev_dependency15.yaml

Note: The performance of Monolingual models have not been evaluated yet, if you want to train a monolingual model, please try the configuration of parser.

Enhanced Universal Dependency (EUD) Parsing

To reproduce our results on EUD Parsing, we provide the conversion scripts for the official dataset. And we also provide our processed training/development/test set for the task. To train the model (here we take the Tamil dataset as an example), run (please refer to config for config files of other languages):

python train_with_teacher.py --config config/xlmr_word_origflair_1000epoch_0.1inter_2000batch_0.002lr_400hidden_ta_monolingual_nocrf_fast_2nd_unrel_250upsample_nodev_enhancedud27.yaml

As we described in the paper, we use the labeled F1 scores (originated from semantic dependency parsing) rather than ELAS for EUD training, therefore if you want to evaluate the ELAS score, first parse the graphs:

python train_with_teacher.py --config config/xlmr_word_origflair_1000epoch_0.1inter_2000batch_0.002lr_400hidden_ta_monolingual_nocrf_fast_2nd_unrel_250upsample_nodev_enhancedud27.yaml --parse --target_dir iwpt2020_test/ta --keep_order --batch_size 1000

Then evaluate the result by the official script: (Note that the official evaluation script does not check the connectivity, if you go strict process of official submission, please fix other validation issues manually. But for the ELAS, the connectivity does not affect the result a lot.)

Semantic Dependency Parsing (SDP)

The code for EUD parsing is also applicable for SDP parsing. We provide a PyTorch version of our second-order SDP parser (For the TensorFlow Version) here. However, we have not evaluate the performance on SDP datasets yet. You may need to modifiy some code and hyper-parameters to run on SDP datasets.

Others

Write Your Own Config File

We provide a detailed description of our config file in config.

GPU Memory

We have update the code for better GPU utilization, therefore training a multilingual sequence labeling with knowledge distillation only needs 8~9 GB for the GPU Memory now rather than 14~15 GB reported in the paper.

Faster Speed

We modified the code of flair for a signficantly faster training speed. For example, we update the CharacterEmbeddings class in embeddings.py to FastCharacterEmbeddings for significantly faster character embedding speed and the WordEmbeddings is updated to FastWordEmbeddings so that the word embeddings can be updated during training. For training sequence labelers, our code is more than 1.5 times faster than the origin version with word and character embeddings.

Citing Us

For Sequence Labelers

Please cite the following paper when training the multilingual sequence labeling models:

@inproceedings{wang-etal-2020-structure,
    title = "Structure-Level Knowledge Distillation For Multilingual Sequence Labeling",
    a

MultilangStructureKD

Install / Use

README