QSVD
This repository provides the official implementation of QSVD, a method for efficient low-rank approximation that unifies Query-Key-Value (QKV) weight compression in low-precision Vision-Language Models (VLMs).
Install / Use
/learn @SAI-Lab-NYU/QSVDREADME
QSVD: Efficient Low-Rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models
This repository provides the official implementation of QSVD, a method for efficient low-rank approximation that unifies Query-Key-Value (QKV) weight compression in low-precision Vision-Language Models (VLMs).
🌟 Highlights
-
🧩 Joint QKV Decomposition:
QSVD performs a joint singular value decomposition on the concatenated query–key–value weight matrices $[W_q, W_k, W_v]$, sharing a common down-projection $W_{qkv}^{d}$.
→ Reduces parameters, low-rank KV-cache, and FLOPs compared to per-matrix SVD. -
📊 Cross-Layer Rank Allocation:
Introduces an adaptive cross-layer rank allocation strategy to allocate ranks across all layers based on each singular value’s contribution to model loss, enabling fine-grained, gradient-guided truncation.
→ Preserves critical components while truncating redundant ones across all layers. -
🎯 Post-Training Quantization for Low-Rank VLMs:
Combines dual orthogonal rotations $(H_1, \ H_2)$ to smooth channel-wise outliers in both activations and latent buffers, together with an adaptive exponent β that rescales singular values to balance channel distributions and reduce quantization error.
→ Jointly suppresses activation variance and outlier amplification, enabling stable low-precision inference with minimal degradation.
🔧 Requirements
This implementation utilize the myllava repository, adapted from the original LLaVA repo. Please follow the steps below to set up the environment:
git submodule update --init --recursive
conda create -n QSVD python=3.10 -y
conda activate QSVD
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu126
pip install --no-build-isolation -r requirements.txt
<!-- > ⚠️ Note: Ensure the QSVD components and any relevant QuaRot setup are reinstalled correctly. -->
📊 Evaluation
To evaluate QSVD and reproduce our results, follow the steps below.
📁 Dataset Preparation
Follow the LLaVA evaluation guide to prepare the following datasets:
- ScienceQA (Train) LLaVA ScienceQA train
- VizWiz (Test) Update the paths in
eval_*.pyanddata_utils.pyaccordingly.
🛠 Evaluation Toolkit
We use third_party/VLMEvalKit for evaluation. Please follow its Quickstart for environment setup and usage.
▶️ Running Evaluations
We provide pre-computed calibration cache files to directly reproduce the main QSVD results without rerunning the whitening data and gradient collection. All pre-computed cache files are organized under the cache_file directory.
Each cache package includes:
- Activation-aware whitening data for ASVD-style preprocessing, computed from 256 calibration samples.
- Gradient-square expectations of all singular values, estimated on the same dataset.
- Final importance scores used for cross-layer rank allocation in the joint QKV SVD.
Currently released model cache:
To reproduce our main results on ScienceQA with the provided cache files, you can directly run the following example for LLaVA-Next 7B:
export HF_HOME='your_hf_home'
cd path_to_QSVD/fake_quant
conda activate QSVD
bash path_to_QSVD/scripts/fp16_cache_llavanext.sh 0.9
This script will automatically:
- Load the pre-computed calibration cache from
cache_file/llava-next-7b. - Apply the joint QKV SVD using the stored whitening data, and adaptively truncate singular values based on the pre-computed importance scores and rank budgets.
- Evaluate the compressed model under the default R₁ = 60 % / R₂ = 22.5 % configuration.
💡 Note:
The argument0.9in the command specifies the rank budget ratio used for joint QKV SVD.
After decomposition, the overall preserved rank is approximately 0.9 / 2 = 0.45, meaning that about 45 % of the original rank is retained (dividing by 2 is a legacy setting in the current script).
This configuration corresponds to R₁ = 60 % / R₂ = 22.5 % in our paper.Similarly, using
0.8results in an effective ratio of 0.8 / 2 = 0.4, i.e., retaining 40 % of the rank, which corresponds to R₁ = 53.33 % / R₂ = 20.0 %.The mapping between rank ratio → (R₁, R₂) follows directly from the definition of joint QKV SVD in QSVD.
You can dynamically adjust this parameter (
0.9,0.8, etc.) to control the effective rank budget, balancing the trade-off between compression ratio and accuracy.
For LLaVA-Next 13B, modify the script scripts/fp16_cache_llavanext.sh:
- Set the cache path to:
cache_file="../cache_file/llava-next-13b" - Change the model argument to::
--model llava-hf/llava-v1.6-vicuna-13b-hf - Then run the same command:
bash path_to_QSVD/scripts/fp16_cache_llavanext.sh 0.9
For SmolVLM-Instruct 2B, use the corresponding script:
bash path_to_QSVD/scripts/fp16_cache_smolvlm.sh 1.5
Similarly, the argument 1.5 here specifies a rank budget ratio corresponding to an effective 75 % retained rank, which maps to R₁ = 100 % / R₂ = 50 % in our joint-SVD configuration.
🔎 More Details
For more usage and custom evaluations, explore the instructions and scripts in fake_quant and scripts. Currently, we only support SmolVLM, LLaVA-v1.5, LLaVA-Next models. You can simply run the mainsmolvlm.py, mainllava.py, or mainllavanext.py accordingly to reproduce the results in the paper. The most important arguments are:
--model: Model name (or path to the weights)--seed: Control the random seed--nsamples: Number of samples for SVD calibration--rotate: Whether we want to rotate the model (apply quarot)--tasks: Tasks for LM-Eval--cal_dataset: Calibration dataset for GPTQ quantization/SVD calibration (currently supportScienceQA_Train)--eval_dataset: Evaluation dataset (currently supportScienceQA_TESTandVizWiz)--a_bits: Number of bits for activation quantization--w_bits: Number of bits for weight quantization--v_bits: Number of bits for value quantization (depracated if using SVD)--k_bits: Number of bits for key quantization (depracated if using SVD)--w_clip: Whether we want to clip the weights--a_clip_ratio: The ratio of clipping for activation--vita_clip_ratio: Override the ratio of clipping for vit activation--lma_clip_ratio: Override the ratio of clipping for language model activation--k_clip_ratio: The ratio of clipping for key (depracated if using SVD)--v_clip_ratio: The ratio of clipping for value (depracated if using SVD)--w_asym: Whether we want to use asymmetric quantization for weights--a_asym: Whether we want to use asymmetric quantization for activation--v_asym: Whether we want to use asymmetric quantization for value--k_asym: Whether we want to use asymmetric quantization for key--a_groupsize: The group size for activation quantization--w_groupsize: The group size for weight quantization--v_groupsize: The group size for value quantization--k_groupsize: The group size for key quantization--svd_mode: Choose how sigma is fused in SVD weights--qkv_fuse: Whether we concact QKV for joint SVD proposed in our paper--calib_method: Choose SVD whitening method (abs_maxandabs_meanfor ASVD-style)--rank_ratio: 2 * SVD rank ratio (the factor of 2 is a legacy setting)--act_aware: Whether use activation aware SVD--had_rank: Whether add rotation (H₂ in our paper) in SVD latent activation--svd_lm: Whether we apply SVD--act_alpha: Activation-aware SVD related hyperparamter of ASVD--vit_module: Whether we apply quantization in ViT--grad_info: Whether we use cross-layer rank allocation proposed in our paper--beta_then_svd: Whether we apply SVD after ViT quantization--cache_file: Path to pre-computed calibration cache file folder--basepath: Path to the parent folder of myllava (where we store ScienceQA and VizWiz dataset)
For example, to run the ScienceQA evaluation of llava-v1.5-7b model with quantizing all weights and activations, you can run the following command:
cd QSVD/fake_quant
python mainllava.py --model liuhaotian/llava-v1.5-7b \
--a_bits 4 \
--w_bits 4 \
--cal_dataset ScienceQA_Train \
--eval_dataset ScienceQA_TEST \
--w_rtn \
--w_clip \
--lma_clip_ratio 0.9 \
--nsamples 256 \
--seed 0 \
--svd_mode "U" \
--qkv_fuse \
--calib_method 'abs_mean' \
--rank_ratio 1.5 \
--act_aware \
--had_rank \
--svd_lm \
--act_alpha 0.5 \
--setting "/sqa/online_then_qkvlm_svd/seed0" \
--rotate \
--vit_module \
--vit_online \
--g
