PipeInfer
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
Install / Use
/learn @AutonomicPerfectionist/PipeInferREADME
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
Based on llama.cpp, inference of LLaMA model in pure C/C++.
<details> <summary>Table of Contents</summary> <ol> <li> <a href="#description">Description</a> </li> <li> <a href="#usage">Usage</a> <ul> <li><a href="#get-the-code">Get the Code</a></li> <li><a href="#build">Build</a></li> <li><a href="#run">Run</a></li> </ul> </li> <li><a href="#tuning-pipeinfer">Tuning PipeInfer</a></li> </ol> </details>Description
PipeInfer is a modification of llama.cpp that is designed for acceleration of inference across multi-node clusters using pipelined asynchronous speculation.
- Supports Llama/Llama2, Falcon, Baichuan, Starcoder, Bloom, and other architectures.
- Based on llama.cpp commit 9656026
- See progress on porting to the latest version of llama.cpp here
Supported platforms:
- [X] Mac OS
- [X] Linux
- [X] Docker
Supported models:
- [X] LLaMA 🦙
- [x] LLaMA 2 🦙🦙
- [X] Falcon
- [X] Alpaca
- [X] GPT4All
- [X] Chinese LLaMA / Alpaca and Chinese LLaMA-2 / Alpaca-2
- [X] Vigogne (French)
- [X] Vicuna
- [X] Koala
- [X] OpenBuddy 🐶 (Multilingual)
- [X] Pygmalion/Metharme
- [X] WizardLM
- [X] Baichuan 1 & 2 + derivations
- [X] Aquila 1 & 2
- [X] Starcoder models
- [X] Mistral AI v0.1
- [X] Refact
- [X] Persimmon 8B
- [X] MPT
- [X] Bloom
- [X] StableLM-3b-4e1t
Usage
PipeInfer is implemented in the speculative example binary. To run PipeInfer follow these steps.
Get the Code
git clone https://github.com/AutonomicPerfectionist/PipeInfer
cd PipeInfer
Install Dependencies
Make sure you have Make or CMake installed, as well as an MPI implementation and compiler. On Debian-based systems these can be installed with the following:
$ sudo apt install build-essential cmake openmpi-bin libopenmpi-dev
Build
To build PipeInfer you have two different options.
-
Using
make:-
On Linux or MacOS:
make speculative CC=mpicc CXX=mpicxx LLAMA_MPI=ON -j
-
-
Using
CMake:mkdir build cd build cmake .. -DCMAKE_C_COMPILER=mpicc -DCMAKE_CXX_COMPILER=mpicxx -DLLAMA_MPI=1 cmake --build . --target speculative --config Release
Download Models
PipeInfer requires two models to be downloaded: the large target model and a smaller speculative model. Both models must have compatible vocabulary, minor differences are allowed but may cause performance degradation. The models used in our paper can be downloaded from the following links:
- Dolphin 70B
- TinyLlama OpenOrca 1.1B
- Orca2
- Goliath 120B
- XWinLM 7B
- XWinLM 13B
- Falcon 180B
- Falcon 7B
- Falcon 40B
Run
To run PipeInfer, make sure you have MPI configured on all nodes, and that the head node can SSH into all other nodes.
It is recommended to place the model files on shared storage to avoid replicating them to each node. We have had success
using a dedicated NFS file server. By default llama.cpp loads the models by mmap'ing them, and only
the tensors needed by the layers assigned to a node will be faulted in. For multi-socket systems, it is recommended to clear the Linux file cache
whenever the layer allocations are changed, as otherwise the tensors may be kept on the wrong NUMA node.
This can be done with the following command:
$ echo 3 | sudo tee /proc/sys/vm/drop_caches
Alternatively, using the --no-mmap switch when running PipeInfer/llama.cpp will read the file contents into RAM directly, bypassing the file cache.
Note that doing so forces the entire model to be loaded into RAM, potentially exceeding the memory capacity of the node.
To execute PipeInfer on a cluster, ensure that each node has access to the speculative binary and the model file. Like any other MPI program,
it can be launched with mpirun or through job managers like Slurm. PipeInfer also requires the layer allocations to be passed in on the command line,
through the --mpi-layer-split argument. For each node in the cluster, one must pass in a floating-point value denoting the percentage
of layers the node will handle. When using multiple models, as in the case of PipeInfer, one must also specify how to split the
nodes into two communicators. This split is denoted by a slash /. All percentages in the first half are tied to the target model, and percentages after
the slash are tied to the speculative model. An example: if there are 5 nodes total, and we want 4 nodes to handle the target model and one node to handle
the speculative model, with even splits among the target nodes, we would pass the following:
--mpi-layer-split 0.25,0.25,0.25,0.25/1.0
If we wanted to dedicate two nodes to the speculative model instead and only 3 to the target model, we would move the slash so that there are three values on the left half and two values on the right. The values themselves must also be adjusted. PipeInfer will automatically allocate any leftover layers to the head node (node 0), but it is best to allocate them explicitly.
An example command to launch PipeInfer across a cluster of 4 nodes is shown below:
mpirun -hostfile /mnt/cluster/hostsfile --bind-to none \
/var/llama.cpp/bin/speculative \
-md /mnt/cluster/models/tinyllama-1.1b-1t-openorca.Q4_K_M.gguf \
-m /mnt/cluster/models/dolphin-2.1-70b.Q3_K_M.gguf \
-e \
-f /mnt/cluster/llama.cpp/prompts/dolphin.txt \
-n 128 \
--mpi-layer-split 0.1,0.4,0.5/1.0 \
--ignore-eos \
--temp -1.0 \
--repeat-last-n 0 \
--draft 4 \
-tb 12,32,40,32 \
-t 6,32,20,32 \
-c 1024 \
-pa 0.001 \
-ps 0.8 \
--numa \
-pr 0.4 \
-pd 0.01 \
-np 3
The confidence cutoff recovery and decay factors are the -pr and -pd parameters, respectively. The -pa parameter is the base acceptance confidence cutoff, and the -ps parameter
is the split confidence threshold. The -np parameter determines the maximum number of sequences within a sequence partition. The --draft parameter determines the maximum number of tokens
in a speculative microbatch. The -t and -tb parameters set the number of threads to use for single-token and batched inference for each node. Finally, we pass the --bind-to none parameter
to mpirun to enable each rank to use more than one thread. All other parameters are unmodified from upstream llama.cpp; use the --help parameter to list all available
parameters and what they do.
Tuning PipeInfer
PipeInfer offers many tunable parameters that can drastically affect the performance. The most important of these are the -pa, -ps, -np, and --draft parameters. These four parameters determine the
average size of a speculative microbatch. An important fact to consider is that the acceptance confidence cutoff (-pa) corresponds to how confident the speculative model is in its own
output, not how confident it is that it matches the target's output.
Some models, such as Orca 2, have high confidence in their own outputs but low alignment with their target. In such a case,
it is best to set the acceptance confidence cutoff high, such as 0.5, and the draft length (--draft) low, such as 3.
Conversely, some models exhibit high alignment with their target. In these cases, increasing the draft length slightly can produce
better performance, but it is almost always best to keep the draft length below 8 to maintain low latency and high granularity with regard to cancellation.
Instead, the acceptance confidence cutoff can be lowered significantly, down to 0.1 in our testing with Goliath and XWin 7B. Such a low cutoff
enables PipeInfer to generate as many speculative runs as possible. Keeping the draft length low as well (3 for Goliath and XWin 7B) enables the system to
generate many small microbatches that can be evaluated with low latency while maintaining a high enough acceptance rate that the pipeline is not thrashed
with flushing operations.
The split co
