SelectiveMasking

Source code for "Train No Evil: Selective Masking for Task-Guided Pre-Training"

Generate Convert Improve

Install / Use

/learn @thunlp/SelectiveMasking

About this skill

Quality Score

0/100

README

Selective Masking

Source code for "Train No Evil: Selective Masking for Task-Guided Pre-Training"

Download Data

The datasets can be downloaded from this link. The datasets need to be put in data/datasets.

Run the Whole Pipeline

Modify config/test.json for input path, output path, BERT model path, GPU usage etc.
run bash scripts/run_all_pipeline.sh .

Run each step

The meaning of each step can be found in the appendix of our paper. The input/output paths are also set in config/test.json. Run python3 convert_config.py config/test.json to convert the .json file to a .sh file.

1 GenePT

We use the training scripts from https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT for general pre-training.

2 Selective Masking

2.1 Finetune BERT

bash scripts/finetune_origin.sh

2.2 Downstream Mask

bash data/create_data_rule/run.sh.

2.3 Train NN

bash scripts/run_mask_model.sh

2.4 In-domain Mask

bash data/create_data_model/run.sh

3 TaskPT

bash scripts/run_pretraining.sh

4 Fine-tune

bash scripts/finetune_ckpt_all_seed.sh
python3 gather_results.py $PATH_TO_THE_FINETUNE_OUTPUT

Cite

If you use the code, please cite this paper:

@inproceedings{gu2020train,
    title={Train No Evil: Selective Masking for Task-Guided Pre-Training},
    author={Yuxian Gu and Zhengyan Zhang and Xiaozhi Wang and Zhiyuan Liu and Maosong Sun},
    year={2020},
    booktitle={Proceedings of EMNLP 2020},
}

Related Skills

node-connect

352.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。