SkillAgentSearch skills...

VoRA

[Fully open] [Encoder-free MLLM] Vision as LoRA

Install / Use

/learn @Hon-Wong/VoRA
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

VoRA: Integrating Visual Capabilities into LLMs

<div align="center">

Website arXiv  huggingface

<p style="font-size: larger; margin-top: -5px;"> <a href="https://arxiv.org/pdf/2503.20680">Vision as LoRA</a> </p> <div align="center" style="width: 100%; margin: 0 auto;"> <img src="assets/framework.gif" alt="Framework" width="70%"> </div> </div>

News

<h3 align="center">Abstract</h3> <p>We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions.</p> <p>To further strengthen VoRA’s visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs.</p>

Install

Clone this repository and install dependencies:

git clone https://github.com/Hon-Wong/VoRA.git
cd VoRA
pip3 install -e .

Data Preparation

We have collected or generated the following datasets for VoRA. The image bytes are also included, so there’s no need to download the images from URLs. The captions were created using a variety of prompts to ensure diversity.

| HF dataset 🤗 | #Samples | Source | Generated by | | ------------ | -------------- | ------------ | ------------ | | VoRA-Recap-8M | 8M | DataComp-1B | Qwen2-VL-72B-Instruct | | VoRA-Recap-29M | 29M | DataComp-1B | Qwen2-VL-72B-Instruct | | VoRA-Recap-GLDv2-1.4M | 1.4M | GLDv2 | Qwen2-VL-72B-Instruct | | VoRA-TextQA-Mixed | 6.3M | Cambrian, LLaVA-ov, Infinity-Instruction, SmolTalk | - |

  1. download the pre-training datasets from HF.

datasets for ablation study

apt-get install git-lfs
git-lfs install
cd {raw_data_dir}
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-Recap-8M

datasets for Pre-training

apt-get install git-lfs
git-lfs install
cd {raw_data_dir}
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-Recap-29M
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-Recap-GLDv2-1.4M
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-TextQA-Mixed
  1. convert parquet to json

For ablation:

python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-Recap-8M --save_dir={data_dir}/VoRA-Recap-8M

For Pre-training:

python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-Recap-29M --save_dir={data_dir}/VoRA-Recap-29M
python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-Recap-GLDv2-1.4M --save_dir={data_dir}/VoRA-Recap-GLDv2-1.4M
python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-TextQA-Mixed --save_dir={data_dir}/VoRA-TextQA-Mixed
  1. Prepare LLaVA-mixture

convert it to VoRA's format:

{
  "id": "00000000",
  "frames": [
      "frames/00000000.jpg",
  ],
  "conversations": [
      {
          "from": "human",
          "value": "Describe this image in detail."
      },
      {
          "from": "gpt",
          "value": "<image>\nThis image is a ..."
      }
  ]
}

Also, if you want to use your own data, simply follow the step to format.

Training

  1. Pre-training

Set the config file in configs/pretrain_I30M_T6M.yaml. Make sure the global batchsize is 256. Change the data path and model path to your local ones.

Train VoRA on a single node with 8 GPUs:

deepspeed --master_port=20000 train/train.py configs/pretrain_I30M_T6M.yaml

Train VoRA on multi nodes:

torchrun --nproc_per_node 8 --nnodes 4 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT train/train.py configs/pretrain_I30M_T6M.yaml
  1. Finetuning

Merge LoRA weights into the base model:

python3 tools/merge_lora.py --config=configs/pretrain_I30M_T6M.yaml --checkpoint={your_checkpoint_dir} --save_dir={your_save_dir}

Then set the model.pretrained in configs/finetune.yaml, and run:

deepspeed --master_port=20000 train/train.py configs/finetune.yaml

Evaluation

The original results in the paper were evaluated using a suite similar to LLaVA. Alternatively, you can use LMMs-Eval to evaluate the model.

  1. Install LMMs-Eval:
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip3 install -e .
  1. Evaluate

Evaluate the checkpoints in the paper:

export HF_TOKEN={your_hf_token}
python3 -m accelerate.commands.launch --num_processes=8 --main_process_port=51999 -m lmms_eval --tasks textvqa_val --model vora --model_args pretrained=Hon-Wong/VoRA-7B-Instruct --batch_size 1 --log_samples --output_path ./logs/

Evaluate your own model:

export HF_TOKEN={your_hf_token}
cp generation_files/* {your_model_dir}
python3 -m accelerate.commands.launch --num_processes=8 --main_process_port=51999 -m lmms_eval --tasks textvqa_val --model vora --model_args pretrained={your_model_dir} --batch_size 1 --log_samples --output_path ./logs/

Citation

If you find this repository useful, please consider citing and starring it:

@article{wang2025vision,
  title={Vision as LoRA},
  author={Wang, Han and Ye, Yongjie and Li, Bingru and Nie, Yuxiang and Lu, Jinghui and Tang, Jingqun and Wang, Yanjie and Huang, Can},
  journal={arXiv preprint arXiv:2503.20680},
  year={2025}
}

Related Skills

View on GitHub
GitHub Stars384
CategoryDevelopment
Updated5d ago
Forks30

Languages

Python

Security Score

85/100

Audited on Mar 26, 2026

No findings