SkillAgentSearch skills...

TextBrewer

A PyTorch-based knowledge distillation toolkit for natural language processing

Install / Use

/learn @airaria/TextBrewer
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

English | 中文说明

<p align="center"> <br> <img src="./pics/banner.png" width="500"/> <br> <p> <p align="center"> <a href="https://github.com/airaria/TextBrewer/blob/master/LICENSE"> <img alt="GitHub" src="https://img.shields.io/github/license/airaria/TextBrewer.svg?color=blue&style=flat-square"> </a> <a href="https://textbrewer.readthedocs.io/"> <img alt="Documentation" src="https://img.shields.io/website?down_message=offline&label=Documentation&up_message=online&url=https%3A%2F%2Ftextbrewer.readthedocs.io"> </a> <a href="https://pypi.org/project/textbrewer"> <img alt="PyPI" src="https://img.shields.io/pypi/v/textbrewer"> </a> <a href="https://github.com/airaria/TextBrewer/releases"> <img alt="GitHub release" src="https://img.shields.io/github/v/release/airaria/TextBrewer?include_prereleases"> </a> </p>

TextBrewer is a PyTorch-based model distillation toolkit for natural language processing. It includes various distillation techniques from both NLP and CV field and provides an easy-to-use distillation framework, which allows users to quickly experiment with the state-of-the-art distillation methods to compress the model with a relatively small sacrifice in the performance, increasing the inference speed and reducing the memory usage.

Check our paper through ACL Anthology or arXiv pre-print.

Full Documentation

News

Dec 17, 2021

  • We have released a model pruning toolkit TextPruner. Check https://github.com/airaria/TextPruner

Oct 24, 2021

  • We propose the first pre-trained language model that specifically focusing on Chinese minority languages. Check:https://github.com/ymcui/Chinese-Minority-PLM

Jul 8, 2021

  • New examples with Transformers 4
    • The current examples (exmaples/) have been written with old versions of Transformers and they may cause some confusions and bugs. We rewrite the examples with Transformers 4 in Jupyter Notebooks, which are easy to follow and learn.
    • The new examples can be found at examples/notebook_examples. See Examples for details.

Mar 1, 2021

  • BERT-EMD and custom distiller

    • We added an experiment with BERT-EMD in the MNLI exmaple. BERT-EMD allows each intermediate student layer to learn from any intermediate teacher layers adaptively, based on optimizing Earth Mover’s Distance. So there is no need to specify the matching scheme.
    • We have written a new EMDDistiller to perform BERT-EMD. It demonstrates how to write a custom distiller.
  • updated MNLI example

    • We removed the pretrained_pytorch_bert and used transformers library instead in all the MNLI examples.
<details> <summary>Click here to see old news</summary>

Nov 11, 2020

  • Updated to 0.2.1:

    • More flexible distillation: Supports feeding different batches to the student and teacher. It means the batches for the student and teacher no longer need to be the same. It can be used for distilling models with different vocabularies (e.g., from RoBERTa to BERT).

    • Faster distillation: Users now can pre-compute and cache the teacher outputs, then feed the cache to the distiller to save teacher's forward pass time.

      See Feed Different batches to Student and Teacher, Feed Cached Values for details of the above features.

    • MultiTaskDistiller now supports intermediate feature matching loss.

    • Tensorboard now records more detailed losses (KD loss, hard label loss, matching losses...).

    See details in releases.

August 27, 2020

We are happy to announce that our model is on top of GLUE benchmark, check leaderboard.

Aug 24, 2020

  • Updated to 0.2.0.1:
    • fixed bugs in MultiTaskDistiller and training loops.

Jul 29, 2020

  • Updated to 0.2.0:
    • Added the support for distributed data-parallel training with DistributedDataParallel: TrainingConfig now accpects the local_rank argument. See the documentation of TrainingConfig for detail.
  • Added an example of distillation on the Chinese NER task to demonstrate distributed data-parallel training. See examples/msra_ner_example.

Jul 14, 2020

  • Updated to 0.1.10:
    • Now supports mixed precision training with Apex! Just set fp16 to True in TrainingConfig. See the documentation of TrainingConfig for detail.
    • Added data_parallel option in TrainingConfig to enable data parallel training and mixed precision training work together.

Apr 26, 2020

  • Added Chinese NER task (MSRA NER) results.
  • Added results for distilling to T12-nano model, which has a similar strcuture to Electra-small.
  • Updated some results of CoNLL-2003, CMRC 2018 and DRCD.

Apr 22, 2020

  • Updated to 0.1.9 (added cache option which speeds up distillation; fixed some bugs). See details in releases.
  • Added experimential results for distilling Electra-base to Electra-small on Chinese tasks.
  • TextBrewer has been accepted by ACL 2020 as a demo paper, please use our new bib entry.

Mar 17, 2020

Mar 11, 2020

  • Updated to 0.1.8 (Improvements on TrainingConfig and train method). See details in releases.

Mar 2, 2020

  • Initial public version 0.1.7 has been released. See details in releases.
</details>

Table of Contents

<!-- TOC -->

| Section | Contents | |-|-| | Introduction | Introduction to TextBrewer | | Installation | How to install | | Workflow | Two stages of TextBrewer workflow | | Quickstart | Example: distilling BERT-base to a 3-layer BERT | | Experiments | Distillation experiments on typical English and Chinese datasets | | Core Concepts | Brief explanations of the core concepts in TextBrewer | | FAQ | Frequently asked questions | | Known Issues | Known issues | | Citation | Citation to TextBrewer | | Follow Us | - |

<!-- /TOC -->

Introduction

Textbrewer is designed for the knowledge distillation of NLP models. It provides various distillation methods and offers a distillation framework for quickly setting up experiments.

The main features of TextBrewer are:

  • Wide-support: it supports various model architectures (especially transformer-based models)
  • Flexibility: design your own distillation scheme by combining different techniques; it also supports user-defined loss functions, modules, etc.
  • Easy-to-use: users don't need to modify the model architectures
  • Built for NLP: it is suitable for a wide variety of NLP tasks: text classification, machine reading comprehension, sequence labeling, ...

TextBrewer currently is shipped with the following distillation techniques:

  • Mixed soft-label and hard-label training
  • Dynamic loss weight adjustment and temperature adjustment
  • Various distillation loss functions: hidden states MSE, attention-matrix-based loss, neuron selectivity transfer, ...
  • Freely adding intermediate features matching losses
  • Multi-teacher distillation
  • ...

TextBrewer includes:

  1. Distillers: the cores of distillation. Different distillers perform different distillation modes. There are GeneralDistiller, MultiTeacherDistiller, BasicTrainer, etc.
  2. Configurations and presets: Configuration classes for training and distillation, and predefined distillation loss functions and strategies.
  3. Utilities: auxiliary tools such as model parameters analysis.

To start distillation, users need to provide

  1. the models (the trained teacher model and the un-trained student model)
  2. datasets and experiment configurations

TextBrewer has achieved impressive results on several typical NLP tasks. See Experiments.

See Full Documentation for detailed usages.

Architecture

Installation

  • Requirements

    • Python >= 3.6
    • PyTorch >= 1.1.0
    • TensorboardX or Tensorboard
    • NumPy
    • tqdm
    • Transformers >= 2.0 (optional, used by some examples)
    • Apex == 0.1.0 (optional, mixed precision training)
  • Install from PyPI

    pip install textbrewer
    
  • Install from the Github source

    git clone https://github.com/airaria/TextBrewer.git
    pip install ./textbrewer
    

Workflow

  • Stage 1: Preparation:

    1. Train the teacher model
    2. Define and initialize the student model
    3. Construct a dataloader, an optimizer, and a learning rate scheduler
  • Stage 2: Distillation with TextBrewer:

    1. Construct a TraningConfig and a DistillationConfig, initialize a distiller
    2. Define an adaptor and a callback. The adaptor is used for adaptation of model inputs and outputs. The callback is called by the distiller during training
    3. Call the train method of the distiller

Quickstart

Here we show the usage of TextBrewer by distilling BERT-base to a 3-layer BERT.

Before distill

Related Skills

View on GitHub
GitHub Stars1.7k
CategoryDevelopment
Updated1d ago
Forks246

Languages

Python

Security Score

100/100

Audited on Apr 2, 2026

No findings