ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding

Jialiang Kang, Han Shu, Wenshuo Li, Yingjie Zhai, Xinghao Chen

Overview

Speculative decoding is a widely adopted technique for accelerating inference in large language models (LLMs), yet its application to vision-language models (VLMs) remains underexplored, with existing methods achieving only modest speedups ($<1.5\times$). This gap is increasingly significant as multimodal capabilities become central to large-scale models. We hypothesize that large VLMs can effectively filter redundant image information layer by layer without compromising textual comprehension, whereas smaller draft models struggle to do so. To address this, we introduce Vision-Aware Speculative Decoding (ViSpec), a novel framework tailored for VLMs. ViSpec employs a lightweight vision adaptor module to compress image tokens into a compact representation, which is seamlessly integrated into the draft model's attention mechanism while preserving original image positional information. Additionally, we extract a global feature vector for each input image and augment all subsequent text tokens with this feature to enhance multimodal coherence. To overcome the scarcity of multimodal datasets with long assistant responses, we curate a specialized training dataset by repurposing existing datasets and generating extended outputs using the target VLM with modified prompts. Our training strategy mitigates the risk of the draft model exploiting direct access to the target model's hidden states, which could otherwise lead to shortcut learning when training solely on target model outputs. Extensive experiments validate ViSpec, achieving, to our knowledge, the first substantial speedup in VLM speculative decoding.

Requirements

The code requires python>=3.10 and transformers==4.51.3. You can install the dependencies using pip:

pip install -r requirements.txt

Weights

Usage

The workflow consists of three main stages: training data generation, model training, and evaluation.

We provide several pre-trained model checkpoints on Hugging Face (see the Weights section above). If you wish to use these, you can download them and skip the data generation and training sections, proceeding directly to Stage 3: Evaluation.

1. Training Data Generation

This process involves generating two distinct datasets for the two training stages.

1.1. Generating Text-Only Data for Initial Training

First, generate the text-only data. This dataset will be used in the first stage of model training.

python -m vispec.ge_data.allocation_{llava,qwen}_shargpt \
  --outdir=<path_to_text_data_folder> \
  --start=0 \
  --end=67999 \
  --model={Qwen/Qwen2.5-VL-3B-Instruct,Qwen/Qwen2.5-VL-7B-Instruct,llava-hf/llava-v1.6-vicuna-7b-hf,llava-hf/llava-v1.6-vicuna-13b-hf}

1.2. Generating Multimodal Data for ViSpec Training

Next, generate the multimodal data. This dataset is used in the second stage to train the ViSpec module.

python -m vispec.ge_data.allocation_{llava,qwen}_pretrain_gen \
  --outdir=<path_to_multimodal_data_folder> \
  --start=0 \
  --end=67999 \
  --model={Qwen/Qwen2.5-VL-3B-Instruct,Qwen/Qwen2.5-VL-7B-Instruct,llava-hf/llava-v1.6-vicuna-7b-hf,llava-hf/llava-v1.6-vicuna-13b-hf}

Parameters:

--outdir: The directory where the generated data will be stored.
--start/--end: The index range of the data to be generated.
--model: The base vision-language model to use for data generation.

2. Model Training

Model training is a two-stage process.

2.1: Initial Training

This stage performs initial training on the draft model using the text-only data generated in Step 1.1.

accelerate launch --multi_gpu \
  -m --mixed_precision=bf16 \
  vispec.train.main \
  --cpdir=<path_to_output_checkpoints_folder> \
  --basepath={Qwen/Qwen2.5-VL-3B-Instruct,Qwen/Qwen2.5-VL-7B-Instruct,llava-hf/llava-v1.6-vicuna-7b-hf,llava-hf/llava-v1.6-vicuna-13b-hf} \
  --begin-epoch=0 \
  --bs=1 \
  --configpath=vispec/train/<yourconfig.json> \
  --lr=3e-5 \
  --max-len=4096 \
  --num-workers=8 \
  --tmpdir=<path_to_text_data_folder>

2.2: Training with ViSpec

This stage continues the training using our proposed ViSpec method and the multimodal data from Step 1.2. It loads the checkpoint from Stage 2.1 to initialize the model weights.

accelerate launch --multi_gpu \
  -m --mixed_precision=bf16 \
  vispec.train.main_mtp \
  --cpdir=<path_to_output_checkpoints_folder> \
  --basepath={Qwen/Qwen2.5-VL-3B-Instruct,Qwen/Qwen2.5-VL-7B-Instruct,llava-hf/llava-v1.6-vicuna-7b-hf,llava-hf/llava-v1.6-vicuna-13b-hf} \
  --begin-epoch=0 \
  --bs=1 \
  --configpath=vispec/train/<yourconfig.json> \
  --loadpath=<path_to_stage1_checkpoint>/state_20/model.safetensors \
  --lr=3e-6 \
  --max-len=4096 \
  --mtp-steps=1 \
  --num-q=2 \
  --num-workers=8 \
  --tmpdir=<path_to_multimodal_data_folder> \
  --use-ours=True

Key Parameters:

--cpdir: The output directory for training checkpoints.
--tmpdir: The input directory containing the appropriate training data for each stage.
--configpath: Path to the model configuration file.
--loadpath: Path to load the pre-trained weights from Stage 2.1.
--lr: Learning rate (e.g., 3e-6).
--num-q: Number of query vectors (e.g., 2).
--use-ours: Flag to enable ViSpec.

3. Evaluation

Evaluate the inference speed of the model using both standard autoregressive decoding (baseline) and speculative decoding.

Note: You may safely ignore warnings like rotary_emb.inv_freq being newly initialized.

Baseline Speed Evaluation

python -m vispec.evaluation.gen_baseline_answer_xxx \
  --base-model-path={Qwen/Qwen2.5-VL-3B-Instruct,Qwen/Qwen2.5-VL-7B-Instruct,llava-hf/llava-v1.6-vicuna-7b-hf,llava-hf/llava-v1.6-vicuna-13b-hf} \
  --model-id test \
  --bench-name=<path_to_baseline_results_folder> \
  --spec-model-path=<path_to_your_model_directory> \
  --temperature=<value>

Parameters:

--bench-name: The output directory for evaluation results.
--spec-model-path: Path to the directory containing the ViSpec model checkpoint. This can be a model you trained or one downloaded from Hugging Face.
--temperature: Sampling temperature (e.g., 0.0 for greedy, 1.0 for stochastic).

Speculative Decoding Speed Evaluation

python -m vispec.evaluation.gen_spec_answer_xxx \
  --base-model-path={Qwen/Qwen2.5-VL-3B-Instruct,Qwen/Qwen2.5-VL-7B-Instruct,llava-hf/llava-v1.6-vicuna-7b-hf,llava-hf/llava-v1.6-vicuna-13b-hf} \
  --model-id test \
  --bench-name=<path_to_spec_results_folder> \
  --spec-model-path=<path_to_your_model_directory> \
  --num-q=2 \
  --depth=<value> \
  --top-k=<value> \
  --total-token=<value> \
  --use-ours=True \
  --temperature=<value>

Specific Parameters:

--depth: The depth of draft token tree.
--top-k: The width of draft token tree.
--total-token: Number of draft tokens selected from the draft tree to be verified by target model.
--num-q: Number of query vectors; must be consistent with the training configuration (e.g., 2).

Evaluation Results

Speedup ratios and average acceptance lengths $\tau$ for different methods. Speedup ratios are computed based on the average time required to generate each token.

| Model | Method | SQA | | MM-Vet | | TextVQA | | MME | | COCO Caps | | VizWiz | | GQA | | SEED-Bench | | Avg. | | | :---------------------- | :--------- | :--------: | :---------: | :--------: | :---------: | :--------: | :---------: | :--------: | :---------: | :--------: | :---------: | :--------: | :---------: | :--------: | :---------: | :--------: | :---------: | :--------: | :---------: | | | | $\tau$ | Speedup | $\tau$ | Speedup | $\tau$ | Speedup | $\tau$ | **Speedu

ViSpec

Install / Use

README