PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

Based on llama.cpp, inference of LLaMA model in pure C/C++.

<details> <summary>Table of Contents</summary> <ol> <li> <a href="#description">Description</a> </li> <li> <a href="#usage">Usage</a> <ul> <li><a href="#get-the-code">Get the Code</a></li> <li><a href="#build">Build</a></li> <li><a href="#run">Run</a></li> </ul> </li> <li><a href="#tuning-pipeinfer">Tuning PipeInfer</a></li> </ol> </details>

Description

PipeInfer is a modification of llama.cpp that is designed for acceleration of inference across multi-node clusters using pipelined asynchronous speculation.

Supports Llama/Llama2, Falcon, Baichuan, Starcoder, Bloom, and other architectures.
Based on llama.cpp commit 9656026
See progress on porting to the latest version of llama.cpp here

Supported platforms:

[X] Mac OS
[X] Linux
[X] Docker

Supported models:

[X] LLaMA 🦙
[x] LLaMA 2 🦙🦙
[X] Falcon
[X] Alpaca
[X] GPT4All
[X] Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
[X] Vigogne (French)
[X] Vicuna
[X] Koala
[X] OpenBuddy 🐶 (Multilingual)
[X] Pygmalion/Metharme
[X] WizardLM
[X] Baichuan 1 & 2 + derivations
[X] Aquila 1 & 2
[X] Starcoder models
[X] Mistral AI v0.1
[X] Refact
[X] Persimmon 8B
[X] MPT
[X] Bloom
[X] StableLM-3b-4e1t

Usage

PipeInfer is implemented in the speculative example binary. To run PipeInfer follow these steps.

Get the Code

git clone https://github.com/AutonomicPerfectionist/PipeInfer
cd PipeInfer

Install Dependencies

Make sure you have Make or CMake installed, as well as an MPI implementation and compiler. On Debian-based systems these can be installed with the following:

$ sudo apt install build-essential cmake openmpi-bin libopenmpi-dev

Build

To build PipeInfer you have two different options.

Using make:

On Linux or MacOS:

make speculative CC=mpicc CXX=mpicxx LLAMA_MPI=ON -j

Using CMake:

mkdir build
cd build
cmake .. -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DLLAMA_MPI=1
cmake --build . --target speculative --config Release

Download Models

PipeInfer requires two models to be downloaded: the large target model and a smaller speculative model. Both models must have compatible vocabulary, minor differences are allowed but may cause performance degradation. The models used in our paper can be downloaded from the following links:

Run

To run PipeInfer, make sure you have MPI configured on all nodes, and that the head node can SSH into all other nodes. It is recommended to place the model files on shared storage to avoid replicating them to each node. We have had success using a dedicated NFS file server. By default llama.cpp loads the models by mmap'ing them, and only the tensors needed by the layers assigned to a node will be faulted in. For multi-socket systems, it is recommended to clear the Linux file cache whenever the layer allocations are changed, as otherwise the tensors may be kept on the wrong NUMA node.

This can be done with the following command:

$ echo 3 | sudo tee /proc/sys/vm/drop_caches

Alternatively, using the --no-mmap switch when running PipeInfer/llama.cpp will read the file contents into RAM directly, bypassing the file cache. Note that doing so forces the entire model to be loaded into RAM, potentially exceeding the memory capacity of the node.

To execute PipeInfer on a cluster, ensure that each node has access to the speculative binary and the model file. Like any other MPI program, it can be launched with mpirun or through job managers like Slurm. PipeInfer also requires the layer allocations to be passed in on the command line, through the --mpi-layer-split argument. For each node in the cluster, one must pass in a floating-point value denoting the percentage of layers the node will handle. When using multiple models, as in the case of PipeInfer, one must also specify how to split the nodes into two communicators. This split is denoted by a slash /. All percentages in the first half are tied to the target model, and percentages after the slash are tied to the speculative model. An example: if there are 5 nodes total, and we want 4 nodes to handle the target model and one node to handle the speculative model, with even splits among the target nodes, we would pass the following:

--mpi-layer-split 0.25,0.25,0.25,0.25/1.0

If we wanted to dedicate two nodes to the speculative model instead and only 3 to the target model, we would move the slash so that there are three values on the left half and two values on the right. The values themselves must also be adjusted. PipeInfer will automatically allocate any leftover layers to the head node (node 0), but it is best to allocate them explicitly.

An example command to launch PipeInfer across a cluster of 4 nodes is shown below:

mpirun -hostfile /mnt/cluster/hostsfile --bind-to none \
    /var/llama.cpp/bin/speculative \
    -md /mnt/cluster/models/tinyllama-1.1b-1t-openorca.Q4_K_M.gguf \
    -m /mnt/cluster/models/dolphin-2.1-70b.Q3_K_M.gguf  \
    -e \
    -f /mnt/cluster/llama.cpp/prompts/dolphin.txt \
    -n 128 \
    --mpi-layer-split 0.1,0.4,0.5/1.0 \
    --ignore-eos \
    --temp -1.0 \
    --repeat-last-n 0 \
    --draft 4 \
    -tb 12,32,40,32 \
    -t 6,32,20,32 \
    -c 1024 \
    -pa 0.001 \
    -ps 0.8 \
    --numa \
    -pr 0.4 \
    -pd 0.01 \
    -np 3

The confidence cutoff recovery and decay factors are the -pr and -pd parameters, respectively. The -pa parameter is the base acceptance confidence cutoff, and the -ps parameter is the split confidence threshold. The -np parameter determines the maximum number of sequences within a sequence partition. The --draft parameter determines the maximum number of tokens in a speculative microbatch. The -t and -tb parameters set the number of threads to use for single-token and batched inference for each node. Finally, we pass the --bind-to none parameter to mpirun to enable each rank to use more than one thread. All other parameters are unmodified from upstream llama.cpp; use the --help parameter to list all available parameters and what they do.

Tuning PipeInfer

PipeInfer offers many tunable parameters that can drastically affect the performance. The most important of these are the -pa, -ps, -np, and --draft parameters. These four parameters determine the average size of a speculative microbatch. An important fact to consider is that the acceptance confidence cutoff (-pa) corresponds to how confident the speculative model is in its own output, not how confident it is that it matches the target's output.

Some models, such as Orca 2, have high confidence in their own outputs but low alignment with their target. In such a case, it is best to set the acceptance confidence cutoff high, such as 0.5, and the draft length (--draft) low, such as 3.

Conversely, some models exhibit high alignment with their target. In these cases, increasing the draft length slightly can produce better performance, but it is almost always best to keep the draft length below 8 to maintain low latency and high granularity with regard to cancellation. Instead, the acceptance confidence cutoff can be lowered significantly, down to 0.1 in our testing with Goliath and XWin 7B. Such a low cutoff enables PipeInfer to generate as many speculative runs as possible. Keeping the draft length low as well (3 for Goliath and XWin 7B) enables the system to generate many small microbatches that can be evaluated with low latency while maintaining a high enough acceptance rate that the pipeline is not thrashed with flushing operations.

The split co

PipeInfer

Install / Use

README