SpecEE

Repo for SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting (ISCA25)

Paper website: https://arxiv.org/pdf/2504.08850

Acknowledgement

The SpecEE is implemented based on the HuggingFace framework in the cloud scenario and the llama.cpp scenario in the edge scenario. We modify the part of code to support the SpecEE, and we will conduct the faster research iteration in the future.

Demo

Left: SpecEE & Right: llama.cpp

SpecEE for Edge Scenario

please enter in the SpecEE_PC directory

Software Dependencies

torch~=2.2.1 numpy~=1.26.4 sentencepiece~=0.2.0 transformers>=4.45.1,<5.0.0 gguf>=0.1.0 protobuf>=4.21.0,<5.0.0

Environment Set up

cd SpecEE_PC
conda create -n specee python=3.12
conda activate specee
pip install -r  ./requirements/requirements-convert_hf_to_gguf.txt

Download model weights from Huggingface (Recommended)

huggingface-cli login
huggingface-cli download YYDH2333/SpecEE-7b-chat-hf

Build model weights from scratch

Download Llama-7b-chat-hf and EAGLE-llama2-chat-7B

huggingface-cli login
huggingface-cli download meta-llama/Llama-2-7b-chat-hf
huggingface-cli download yuhuili/EAGLE-llama2-chat-7B

Copy Llama-7b-chat-hf weights to ./specee_weights

cp Llama-7b-chat-hf/config.json specee_weights/ specee_weights/
cp Llama-7b-chat-hf/generation_config.json specee_weights/
cp Llama-7b-chat-hf/model-00001-of-00002.safetensors specee_weights/
cp Llama-7b-chat-hf/model-00002-of-00002.safetensors specee_weights/
cp Llama-7b-chat-hf/special_tokens_map.json specee_weights/
cp Llama-7b-chat-hf/tokenizer.json specee_weights/
cp Llama-7b-chat-hf/tokenizer.model specee_weights/
cp Llama-7b-chat-hf/tokenizer_config.json specee_weights/

Run ./specee_weights/add_eagle.py

python ./specee_weights/add_eagle.py

Convert safetensors to gguf

python convert_hf_to_gguf.py Llama-2-7b-chat-hf --outtype f16 --outfile models/Llama-2-7b-chat-hf.gguf --verbose 
python convert_hf_to_gguf.py specee_weights --outtype f16 --outfile models/SpecEE-7b-chat-hf.gguf --verbose

Compilation

make GGML_CUDA=1 -j$(nproc)

Running Example

CUDA_VISIBLE_DEVICES=0 \
./llama-cli \
-m models/SpecEE-7b-chat-hf.gguf \
-p "Compose an engaging travel blog post about a recent trip to Hawaii, highlighting cultural experiences and must-see attractions." \
-e -ngl 16 -t 4 -n 256 -c 512 -s 8 --top_k 0 --temp 0

Eval on Alpaca dataset

python eval_on_alpaca_dataset.py

Eval on GSM8k dataset

python eval_on_gsm8k_dataset.py

Eval on HumanEval dataset

python eval_on_humaneval_dataset.py

Eval on MT-Bench dataset

python eval_on_mt-bench_dataset.py

Eval on commonsenseQA dataset

python eval_on_qa_dataset.py

Eval on SUM dataset

python eval_on_sum_dataset.py

SpecEE for Cloud Scenario

please enter in the SpecEE_cloud directory

Setup

First setup the environments.

We need to setup three environments to evaluate SpecEE in huggingface and awq.

Huggingface + SpecEE

cd SpecEE-cloud
conda create -n SpecEE python==3.10
conda activate SpecEE
pip install -r requirements.txt

Raw awq

cd SpecEE-cloud
conda create -n awq python==3.10
conda activate awq
pip install -r requirements_awq.txt

Awq + SpecEE

cd SpecEE-cloud
conda create -n specee_awq python==3.10
conda activate specee_awq
pip install -r requirements_awq.txt
cd AutoAWQ-0.2.6
pip install -e .

Second download the models needed.

huggingface-cli login
huggingface-cli download meta-llama/Llama-2-7b-chat-hf
huggingface-cli download TheBloke/Llama-2-7B-Chat-AWQ
huggingface-cli download yuhuili/EAGLE-llama2-chat-7B

Evaluation

Huggingface + SpecEE

In our code, we can evaluate the speed and accuracy performance based on Huggingface.

Firstly, activate SpecEE environment.

conda activate SpecEE

The example command to evaluate speed.

CUDA_VISIBLE_DEVICES=0 python EEInference.py --base-model-path meta-llama/Llama-2-7b-chat-hf --draft-model-path yuhuili/EAGLE-llama2-chat-7B --dataset mt_bench --task speed --predictor-path [the local path of ./llama-7b]

The example command to evaluate accuracy.

CUDA_VISIBLE_DEVICES=0 python EEInference.py --base-model-path meta-llama/Llama-2-7b-chat-hf --draft-model-path yuhuili/EAGLE-llama2-chat-7B --dataset sst2 --task accuracy --predictor-path [the local path of ./llama-7b]

You can replace the model path information in run_hf_specee.sh to complete all AE experiments replication in one go.

./run_hf_specee.sh

Awq + SpecEE

In our AE code, we can evaluate the speed and accuracy performance based on awq.

Accuracy

Firstly, activate SpecEE environment.

conda activate SpecEE

The example command.

CUDA_VISIBLE_DEVICES=0 python EEInference_awq.py --base-model-path TheBloke/Llama-2-7B-Chat-AWQ --draft-model-path yuhuili/EAGLE-llama2-chat-7B --dataset commonsenseqa --task accuracy --predictor-path [the local path of ./llama-7b]

Speed

Firstly, activate awq environment.

conda activate awq

You can get all awq speed via:

CUDA_VISIBLE_DEVICES=0 python AwqInference.py --base-model-path TheBloke/Llama-2-7B-Chat-AWQ

Next, activate specee_awq environment.

conda activate specee_awq

You can get all awq+specee speed via:

CUDA_VISIBLE_DEVICES=0 python AwqEEInference.py --base-model-path TheBloke/Llama-2-7B-Chat-AWQ --draft-model-path yuhuili/EAGLE-llama2-chat-7B

The speed evaluation results of awq and awq+specee are in raw_awq.json and specee_awq.json.

You can run calculate_awq_speedup.py to get speedup ratio of awq+specee.

> python calculate_awq_speedup.py

SpecEE

Install / Use

README

SpecEE

Acknowledgement

Demo

SpecEE for Edge Scenario

Software Dependencies

Environment Set up

Download model weights from Huggingface (Recommended)

Build model weights from scratch

Compilation

Running Example

Eval on Alpaca dataset

Eval on GSM8k dataset

Eval on HumanEval dataset

Eval on MT-Bench dataset

Eval on commonsenseQA dataset

Eval on SUM dataset

SpecEE for Cloud Scenario

Setup

Huggingface + SpecEE

Raw awq

Awq + SpecEE

Evaluation

Huggingface + SpecEE

Awq + SpecEE

Accuracy

Speed

Related Skills