TextBrewer

A PyTorch-based knowledge distillation toolkit for natural language processing

Generate Convert Improve

Install / Use

/learn @airaria/TextBrewer

About this skill

Quality Score

0/100

README

English | 中文说明

TextBrewer is a PyTorch-based model distillation toolkit for natural language processing. It includes various distillation techniques from both NLP and CV field and provides an easy-to-use distillation framework, which allows users to quickly experiment with the state-of-the-art distillation methods to compress the model with a relatively small sacrifice in the performance, increasing the inference speed and reducing the memory usage.

Check our paper through ACL Anthology or arXiv pre-print.

Full Documentation

News

Dec 17, 2021

We have released a model pruning toolkit TextPruner. Check https://github.com/airaria/TextPruner

Oct 24, 2021

We propose the first pre-trained language model that specifically focusing on Chinese minority languages. Check：https://github.com/ymcui/Chinese-Minority-PLM

Jul 8, 2021

New examples with Transformers 4
- The current examples (exmaples/) have been written with old versions of Transformers and they may cause some confusions and bugs. We rewrite the examples with Transformers 4 in Jupyter Notebooks, which are easy to follow and learn.
- The new examples can be found at examples/notebook_examples. See Examples for details.

Mar 1, 2021

BERT-EMD and custom distiller
- We added an experiment with BERT-EMD in the MNLI exmaple. BERT-EMD allows each intermediate student layer to learn from any intermediate teacher layers adaptively, based on optimizing Earth Mover’s Distance. So there is no need to specify the matching scheme.
- We have written a new EMDDistiller to perform BERT-EMD. It demonstrates how to write a custom distiller.
updated MNLI example
- We removed the pretrained_pytorch_bert and used transformers library instead in all the MNLI examples.

<details> <summary>Click here to see old news</summary>

Nov 11, 2020

Updated to 0.2.1:
- More flexible distillation: Supports feeding different batches to the student and teacher. It means the batches for the student and teacher no longer need to be the same. It can be used for distilling models with different vocabularies (e.g., from RoBERTa to BERT).
- Faster distillation: Users now can pre-compute and cache the teacher outputs, then feed the cache to the distiller to save teacher's forward pass time.
  
  See Feed Different batches to Student and Teacher, Feed Cached Values for details of the above features.
- MultiTaskDistiller now supports intermediate feature matching loss.
- Tensorboard now records more detailed losses (KD loss, hard label loss, matching losses...).
See details in releases.

August 27, 2020

We are happy to announce that our model is on top of GLUE benchmark, check leaderboard.

Aug 24, 2020

Updated to 0.2.0.1:
- fixed bugs in MultiTaskDistiller and training loops.

Jul 29, 2020

Updated to 0.2.0:
- Added the support for distributed data-parallel training with DistributedDataParallel: TrainingConfig now accpects the local_rank argument. See the documentation of TrainingConfig for detail.
Added an example of distillation on the Chinese NER task to demonstrate distributed data-parallel training. See examples/msra_ner_example.

Jul 14, 2020

Updated to 0.1.10:
- Now supports mixed precision training with Apex! Just set fp16 to True in TrainingConfig. See the documentation of TrainingConfig for detail.
- Added data_parallel option in TrainingConfig to enable data parallel training and mixed precision training work together.

Apr 26, 2020

Added Chinese NER task (MSRA NER) results.
Added results for distilling to T12-nano model, which has a similar strcuture to Electra-small.
Updated some results of CoNLL-2003, CMRC 2018 and DRCD.

Apr 22, 2020

Updated to 0.1.9 (added cache option which speeds up distillation; fixed some bugs). See details in releases.
Added experimential results for distilling Electra-base to Electra-small on Chinese tasks.
TextBrewer has been accepted by ACL 2020 as a demo paper, please use our new bib entry.

Mar 17, 2020

Added CoNLL-2003 English NER distillation example. See examples/conll2003_example.

Mar 11, 2020

Updated to 0.1.8 (Improvements on TrainingConfig and train method). See details in releases.

Mar 2, 2020

Initial public version 0.1.7 has been released. See details in releases.

</details>

Introduction

Textbrewer is designed for the knowledge distillation of NLP models. It provides various distillation methods and offers a distillation framework for quickly setting up experiments.

The main features of TextBrewer are:

Wide-support: it supports various model architectures (especially transformer-based models)
Flexibility: design your own distillation scheme by combining different techniques; it also supports user-defined loss functions, modules, etc.
Easy-to-use: users don't need to modify the model architectures
Built for NLP: it is suitable for a wide variety of NLP tasks: text classification, machine reading comprehension, sequence labeling, ...

TextBrewer currently is shipped with the following distillation techniques:

Mixed soft-label and hard-label training
Dynamic loss weight adjustment and temperature adjustment
Various distillation loss functions: hidden states MSE, attention-matrix-based loss, neuron selectivity transfer, ...
Freely adding intermediate features matching losses
Multi-teacher distillation
...

TextBrewer includes:

Distillers: the cores of distillation. Different distillers perform different distillation modes. There are GeneralDistiller, MultiTeacherDistiller, BasicTrainer, etc.
Configurations and presets: Configuration classes for training and distillation, and predefined distillation loss functions and strategies.
Utilities: auxiliary tools such as model parameters analysis.

To start distillation, users need to provide

the models (the trained teacher model and the un-trained student model)
datasets and experiment configurations

TextBrewer has achieved impressive results on several typical NLP tasks. See Experiments.

See Full Documentation for detailed usages.

Architecture

Installation

Requirements
- Python >= 3.6
- PyTorch >= 1.1.0
- TensorboardX or Tensorboard
- NumPy
- tqdm
- Transformers >= 2.0 (optional, used by some examples)
- Apex == 0.1.0 (optional, mixed precision training)
Install from PyPI
```
pip install textbrewer
```

Install from the Github source

git clone https://github.com/airaria/TextBrewer.git
pip install ./textbrewer

Workflow

Stage 1: Preparation:
1. Train the teacher model
2. Define and initialize the student model
3. Construct a dataloader, an optimizer, and a learning rate scheduler
Stage 2: Distillation with TextBrewer:
1. Construct a TraningConfig and a DistillationConfig, initialize a distiller
2. Define an adaptor and a callback. The adaptor is used for adaptation of model inputs and outputs. The callback is called by the distiller during training
3. Call the train method of the distiller

Quickstart

Here we show the usage of TextBrewer by distilling BERT-base to a 3-layer BERT.

Before distill

Related Skills

node-connect

347.2k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.0k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

347.2k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

347.2k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。

airaria

View profile

View on GitHub

GitHub Stars1.7k

CategoryDevelopment

Updated1d ago

Forks246

airaria/TextBrewer

Languages

Python

Security Score

100/100

Audited on Apr 2, 2026

No findings

TextBrewer

Install / Use

README

News

Table of Contents

Introduction

Architecture

Installation

Workflow

Quickstart

Related Skills