TextBrewer
A PyTorch-based knowledge distillation toolkit for natural language processing
Install / Use
/learn @airaria/TextBrewerREADME
TextBrewer is a PyTorch-based model distillation toolkit for natural language processing. It includes various distillation techniques from both NLP and CV field and provides an easy-to-use distillation framework, which allows users to quickly experiment with the state-of-the-art distillation methods to compress the model with a relatively small sacrifice in the performance, increasing the inference speed and reducing the memory usage.
Check our paper through ACL Anthology or arXiv pre-print.
News
Dec 17, 2021
- We have released a model pruning toolkit TextPruner. Check https://github.com/airaria/TextPruner
Oct 24, 2021
- We propose the first pre-trained language model that specifically focusing on Chinese minority languages. Check:https://github.com/ymcui/Chinese-Minority-PLM
Jul 8, 2021
- New examples with Transformers 4
- The current examples (exmaples/) have been written with old versions of Transformers and they may cause some confusions and bugs. We rewrite the examples with Transformers 4 in Jupyter Notebooks, which are easy to follow and learn.
- The new examples can be found at examples/notebook_examples. See Examples for details.
Mar 1, 2021
-
BERT-EMD and custom distiller
- We added an experiment with BERT-EMD in the MNLI exmaple. BERT-EMD allows each intermediate student layer to learn from any intermediate teacher layers adaptively, based on optimizing Earth Mover’s Distance. So there is no need to specify the matching scheme.
- We have written a new EMDDistiller to perform BERT-EMD. It demonstrates how to write a custom distiller.
-
updated MNLI example
- We removed the pretrained_pytorch_bert and used transformers library instead in all the MNLI examples.
Nov 11, 2020
-
Updated to 0.2.1:
-
More flexible distillation: Supports feeding different batches to the student and teacher. It means the batches for the student and teacher no longer need to be the same. It can be used for distilling models with different vocabularies (e.g., from RoBERTa to BERT).
-
Faster distillation: Users now can pre-compute and cache the teacher outputs, then feed the cache to the distiller to save teacher's forward pass time.
See Feed Different batches to Student and Teacher, Feed Cached Values for details of the above features.
-
MultiTaskDistillernow supports intermediate feature matching loss. -
Tensorboard now records more detailed losses (KD loss, hard label loss, matching losses...).
See details in releases.
-
August 27, 2020
We are happy to announce that our model is on top of GLUE benchmark, check leaderboard.
Aug 24, 2020
- Updated to 0.2.0.1:
- fixed bugs in
MultiTaskDistillerand training loops.
- fixed bugs in
Jul 29, 2020
- Updated to 0.2.0:
- Added the support for distributed data-parallel training with
DistributedDataParallel:TrainingConfignow accpects thelocal_rankargument. See the documentation ofTrainingConfigfor detail.
- Added the support for distributed data-parallel training with
- Added an example of distillation on the Chinese NER task to demonstrate distributed data-parallel training. See examples/msra_ner_example.
Jul 14, 2020
- Updated to 0.1.10:
- Now supports mixed precision training with Apex! Just set
fp16toTrueinTrainingConfig. See the documentation ofTrainingConfigfor detail. - Added
data_paralleloption inTrainingConfigto enable data parallel training and mixed precision training work together.
- Now supports mixed precision training with Apex! Just set
Apr 26, 2020
- Added Chinese NER task (MSRA NER) results.
- Added results for distilling to T12-nano model, which has a similar strcuture to Electra-small.
- Updated some results of CoNLL-2003, CMRC 2018 and DRCD.
Apr 22, 2020
- Updated to 0.1.9 (added cache option which speeds up distillation; fixed some bugs). See details in releases.
- Added experimential results for distilling Electra-base to Electra-small on Chinese tasks.
- TextBrewer has been accepted by ACL 2020 as a demo paper, please use our new bib entry.
Mar 17, 2020
- Added CoNLL-2003 English NER distillation example. See examples/conll2003_example.
Mar 11, 2020
- Updated to 0.1.8 (Improvements on TrainingConfig and train method). See details in releases.
Mar 2, 2020
- Initial public version 0.1.7 has been released. See details in releases.
Table of Contents
<!-- TOC -->| Section | Contents | |-|-| | Introduction | Introduction to TextBrewer | | Installation | How to install | | Workflow | Two stages of TextBrewer workflow | | Quickstart | Example: distilling BERT-base to a 3-layer BERT | | Experiments | Distillation experiments on typical English and Chinese datasets | | Core Concepts | Brief explanations of the core concepts in TextBrewer | | FAQ | Frequently asked questions | | Known Issues | Known issues | | Citation | Citation to TextBrewer | | Follow Us | - |
<!-- /TOC -->Introduction

Textbrewer is designed for the knowledge distillation of NLP models. It provides various distillation methods and offers a distillation framework for quickly setting up experiments.
The main features of TextBrewer are:
- Wide-support: it supports various model architectures (especially transformer-based models)
- Flexibility: design your own distillation scheme by combining different techniques; it also supports user-defined loss functions, modules, etc.
- Easy-to-use: users don't need to modify the model architectures
- Built for NLP: it is suitable for a wide variety of NLP tasks: text classification, machine reading comprehension, sequence labeling, ...
TextBrewer currently is shipped with the following distillation techniques:
- Mixed soft-label and hard-label training
- Dynamic loss weight adjustment and temperature adjustment
- Various distillation loss functions: hidden states MSE, attention-matrix-based loss, neuron selectivity transfer, ...
- Freely adding intermediate features matching losses
- Multi-teacher distillation
- ...
TextBrewer includes:
- Distillers: the cores of distillation. Different distillers perform different distillation modes. There are GeneralDistiller, MultiTeacherDistiller, BasicTrainer, etc.
- Configurations and presets: Configuration classes for training and distillation, and predefined distillation loss functions and strategies.
- Utilities: auxiliary tools such as model parameters analysis.
To start distillation, users need to provide
- the models (the trained teacher model and the un-trained student model)
- datasets and experiment configurations
TextBrewer has achieved impressive results on several typical NLP tasks. See Experiments.
See Full Documentation for detailed usages.
Architecture

Installation
-
Requirements
- Python >= 3.6
- PyTorch >= 1.1.0
- TensorboardX or Tensorboard
- NumPy
- tqdm
- Transformers >= 2.0 (optional, used by some examples)
- Apex == 0.1.0 (optional, mixed precision training)
-
Install from PyPI
pip install textbrewer -
Install from the Github source
git clone https://github.com/airaria/TextBrewer.git pip install ./textbrewer
Workflow


-
Stage 1: Preparation:
- Train the teacher model
- Define and initialize the student model
- Construct a dataloader, an optimizer, and a learning rate scheduler
-
Stage 2: Distillation with TextBrewer:
- Construct a TraningConfig and a DistillationConfig, initialize a distiller
- Define an adaptor and a callback. The adaptor is used for adaptation of model inputs and outputs. The callback is called by the distiller during training
- Call the train method of the distiller
Quickstart
Here we show the usage of TextBrewer by distilling BERT-base to a 3-layer BERT.
Before distill
Related Skills
node-connect
347.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
108.0kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
347.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
347.2kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
