LightSeq: A High Performance Library for Sequence Processing and Generation

Release Notes
Introduction
- Support Matrix
Performance
Installation
- Install from PyPI
- Build from Source
Getting Started
Cite Us
We are Hiring!

Release Notes

[2022.10.25] Release v3.0.0 version, which supports int8 mixed-precision training and inference. [中文介绍]

[2021.06.18] Release v2.0.0 version, which supports fp16 mixed-precision training. [中文介绍]

[2019.12.06] Release v1.0.0 version, which supports fp16 mixed-precision inference. [中文介绍]

Introduction

LightSeq is a high performance training and inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP and CV models such as BERT, GPT, Transformer, etc. It is therefore best useful for machine translation, text generation, image classification, and other sequence related tasks.

The library is built on top of CUDA official library(cuBLAS, Thrust, CUB) and custom kernel functions which are specially fused and optimized for Transformer model family. In addition to model components, the inference library also provide easy-to-deploy model management and serving backend based on TensorRT Inference Server. With LightSeq, one can easily develop modified Transformer architecture with little additional code.

LightSeq training and inference is very fast. Below is the overall performance:

LightSeq fp16 training achieves a speedup of up to 3x, compared to PyTorch fp16 training.
LightSeq int8 training achieves a speedup of up to 5x, compared to PyTorch QAT (i.e., quantization aware training).
LightSeq fp16 and int8 inference achieve a speedup of up to 12x and 15x, compared to PyTorch fp16 inference, respectively.

Support Matrix

LightSeq supports multiple features, which is shown in the table below. | Features | Support List | | ------------------ | -------------------------------------------------------------------- | | Model | Transformer, BERT, BART, GPT2, ViT, T5, MT5, XGLM, VAE, Multilingual, MoE | | Layer | embedding, encoder, decoder, criterion, optimizer | | Precision | fp32, fp16, int8 | | Mode | training, inference | | Compatibility | Fairseq, Hugging Face, DeepSpeed | | Decoding Algorithm | beam search, diverse beam search, sampling, CRF | | Others | gradient communication quantization, auto-tune GEMM algorithm |

The table below shows the running modes and precision currently supported by different models. | Models | fp16 Training | fp16 Inference | int8 Training | int8 Inference | | ------------ | ------------- | -------------- | ------------- | -------------- | | Transformer | Yes | Yes | Yes | Yes | | BERT | Yes | Yes | Yes | Yes | | GPT2 | Yes | Yes | Yes | Yes | | BART | Yes | Yes | - | - | | T5 | - | Yes | - | - | | MT5 | - | Yes | - | - | | XGLM | - | Yes | - | - | | ViT | Yes | Yes | Yes | Yes | | VAE | - | Yes | - | - | | Multilingual | - | Yes | - | Yes | | MoE | - | Yes | - | - |

Performance

We test the speedup of LightSeq training and inference using both fp16 and int8 mix-precision on Transformer and BERT models. The baseline is PyTorch fp16 mix-precision. Training experiments are tested on one A100 GPU and inference experiments are tested on eight A100 GPUs.

More performance results are available here.

Speedup of Transformer Training

| Batch Token Size | PyTorch QAT | LightSeq fp16 | LightSeq int8 | | ---------------- | ----------- | ------------- | ------------- | | 512 | 0.36 | 1.99 | 1.86 | | 1024 | 0.37 | 1.78 | 1.69 | | 2048 | 0.37 | 1.56 | 1.50 | | 4096 | 0.39 | 1.47 | 1.44 | | 8192 | 0.41 | 1.44 | 1.44 | | 15000 | 0.43 | 1.44 | 1.44 |

Speedup of BERT Training

| Batch Token Size | PyTorch QAT | LightSeq fp16 | LightSeq int8 | | ---------------- | ----------- | ------------- | ------------- | | 8 | 0.45 | 2.12 | 1.99 | | 16 | 0.44 | 1.92 | 1.80 | | 32 | 0.42 | 1.59 | 1.52 | | 64 | 0.46 | 1.62 | 1.58 | | 128 | 0.46 | 1.74 | 1.70 | | 256 | 0.46 | 1.68 | 1.73 |

Speedup of Transformer Inference

| Batch Size | Sequence Length | LightSeq fp16 | LightSeq int8 | |------------|-----------------|---------------|---------------| | 1 | 8 | 8.00 | 9.33 | | 1 | 32 | 6.48 | 7.38 | | 1 | 128 | 6.24 | 6.19 | | 8 | 8 | 9.38 | 10.71 | | 8 | 32 | 8.24 | 8.75 | | 8 | 128 | 6.83 | 7.28 | | 32 | 8 | 11.82 | 14.44 | | 32 | 32 | 9.68 | 11.15 | | 32 | 128 | 6.68 | 7.74 |

Speedup of BERT Inference

| Batch Size | Sequence Length | LightSeq fp16 | LightSeq int8 | | ---------- | --------------- | ------------- | ------------- | | 1 | 8 | 9.22 | 9.87 | | 1 | 32 | 10.51 | 11.30 | | 1 | 128 | 9.96 | 10.85 | | 8 | 8 | 9.88 | 10.33 | | 8 | 32 | 7.79 | 8.22 | | 8 | 128 | 4.04 | 4.35 | | 32 | 8 | 10.60 | 11.02 | | 32 | 32 | 8.11 | 8.85 | | 32 | 128 | 1.82 | 2.04 |

Installation

Install from PyPI

You can install LightSeq from PyPI, which only supports Python 3.6 to 3.8 on Linux:

pip install lightseq

Build from Source

You can also build from source:

PATH=/usr/local/hdf5/:$PATH ENABLE_FP32=0 ENABLE_DEBUG=0 pip install -e $PROJECT_DIR

Detailed building introduction is available here.

Getting Started

We provide several samples here to show the usage of LightSeq. Refer to the complete user guide and examples for more details.

LightSeq Training from Scratch

You can use the modules provided by LightSeq to build your own models. The following is an example of building a Transformer encoder layer.

First, import LightSeq Transformer encoder module:

from lightseq.training import LSTransformerEncoderLayer

Then create an encoder configuration, and create a LightSeq Transformer encoder layer initialized with the configuration:

config = LSTransformerEncoderLayer.get_config(
    max_batch_tokens=4096,
    max_seq_len=512,
    hidden_size=1024,
    intermediate_size=4096,
    nhead=16,
    attn_prob_dropout_ratio=0.1,
    activation_dropout_ratio=0.1,
    hidden_dropout_ratio=0.1,
    pre_layer_norm=True,
    activation_fn="relu",
    fp16=True,
    local_rank=0,
)
layer = LSTransformerEncoderLayer(config)

In addition to encoder layers, the other modules can be created using similar methods, and then be trained as normal PyTorch models.

More usage is available here.

Lightseq

Install / Use

README