VoRA

[Fully open] [Encoder-free MLLM] Vision as LoRA

Generate Convert Improve

Install / Use

/learn @Hon-Wong/VoRA

About this skill

Quality Score

0/100

README

VoRA: Integrating Visual Capabilities into LLMs

<p style="font-size: larger; margin-top: -5px;"> <a href="https://arxiv.org/pdf/2503.20680">Vision as LoRA</a> </p> <div align="center" style="width: 100%; margin: 0 auto;"> <img src="assets/framework.gif" alt="Framework" width="70%"> </div> </div>

News

2025-04-16: Training code released.
2025-04-06: LMMs-Eval has supported VoRA.
2025-04-04: VoRA Weights and training data are released.

<h3 align="center">Abstract</h3> <p>We introduce Vision as LoRA (VoRA), a novel paradigm for transforming an LLM into an MLLM. Unlike prevalent MLLM architectures that rely on external vision modules for vision encoding, VoRA internalizes visual capabilities by integrating vision-specific LoRA layers directly into the LLM. This design allows the added parameters to be seamlessly merged into the LLM during inference, eliminating structural complexity and minimizing computational overhead. Moreover, inheriting the LLM's ability of handling flexible context, VoRA can process inputs at arbitrary resolutions.</p> <p>To further strengthen VoRA’s visual capabilities, we introduce a block-wise distillation method that transfers visual priors from a pre-trained ViT into the LoRA layers, effectively accelerating training by injecting visual knowledge. Additionally, we apply bi-directional attention masks to better capture the context information of an image. We successfully demonstrate that with additional pre-training data, VoRA can perform comparably with conventional encode-based MLLMs.</p>

Install

Clone this repository and install dependencies:

git clone https://github.com/Hon-Wong/VoRA.git
cd VoRA
pip3 install -e .

Data Preparation

We have collected or generated the following datasets for VoRA. The image bytes are also included, so there’s no need to download the images from URLs. The captions were created using a variety of prompts to ensure diversity.

| HF dataset 🤗 | #Samples | Source | Generated by | | ------------ | -------------- | ------------ | ------------ | | VoRA-Recap-8M | 8M | DataComp-1B | Qwen2-VL-72B-Instruct | | VoRA-Recap-29M | 29M | DataComp-1B | Qwen2-VL-72B-Instruct | | VoRA-Recap-GLDv2-1.4M | 1.4M | GLDv2 | Qwen2-VL-72B-Instruct | | VoRA-TextQA-Mixed | 6.3M | Cambrian, LLaVA-ov, Infinity-Instruction, SmolTalk | - |

download the pre-training datasets from HF.

datasets for ablation study

apt-get install git-lfs
git-lfs install
cd {raw_data_dir}
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-Recap-8M

datasets for Pre-training

apt-get install git-lfs
git-lfs install
cd {raw_data_dir}
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-Recap-29M
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-Recap-GLDv2-1.4M
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-TextQA-Mixed

convert parquet to json

For ablation:

python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-Recap-8M --save_dir={data_dir}/VoRA-Recap-8M

For Pre-training:

python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-Recap-29M --save_dir={data_dir}/VoRA-Recap-29M
python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-Recap-GLDv2-1.4M --save_dir={data_dir}/VoRA-Recap-GLDv2-1.4M
python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-TextQA-Mixed --save_dir={data_dir}/VoRA-TextQA-Mixed

Prepare LLaVA-mixture

convert it to VoRA's format:

{
  "id": "00000000",
  "frames": [
      "frames/00000000.jpg",
  ],
  "conversations": [
      {
          "from": "human",
          "value": "Describe this image in detail."
      },
      {
          "from": "gpt",
          "value": "<image>\nThis image is a ..."
      }
  ]
}

Also, if you want to use your own data, simply follow the step to format.

Training

Pre-training

Set the config file in configs/pretrain_I30M_T6M.yaml. Make sure the global batchsize is 256. Change the data path and model path to your local ones.

Train VoRA on a single node with 8 GPUs:

deepspeed --master_port=20000 train/train.py configs/pretrain_I30M_T6M.yaml

Train VoRA on multi nodes:

torchrun --nproc_per_node 8 --nnodes 4 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT train/train.py configs/pretrain_I30M_T6M.yaml

Finetuning

Merge LoRA weights into the base model:

python3 tools/merge_lora.py --config=configs/pretrain_I30M_T6M.yaml --checkpoint={your_checkpoint_dir} --save_dir={your_save_dir}

Then set the model.pretrained in configs/finetune.yaml, and run:

deepspeed --master_port=20000 train/train.py configs/finetune.yaml

Evaluation

The original results in the paper were evaluated using a suite similar to LLaVA. Alternatively, you can use LMMs-Eval to evaluate the model.

Install LMMs-Eval:

git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip3 install -e .

Evaluate

Evaluate the checkpoints in the paper:

export HF_TOKEN={your_hf_token}
python3 -m accelerate.commands.launch --num_processes=8 --main_process_port=51999 -m lmms_eval --tasks textvqa_val --model vora --model_args pretrained=Hon-Wong/VoRA-7B-Instruct --batch_size 1 --log_samples --output_path ./logs/

Evaluate your own model:

export HF_TOKEN={your_hf_token}
cp generation_files/* {your_model_dir}
python3 -m accelerate.commands.launch --num_processes=8 --main_process_port=51999 -m lmms_eval --tasks textvqa_val --model vora --model_args pretrained={your_model_dir} --batch_size 1 --log_samples --output_path ./logs/

Citation

If you find this repository useful, please consider citing and starring it:

@article{wang2025vision,
  title={Vision as LoRA},
  author={Wang, Han and Ye, Yongjie and Li, Bingru and Nie, Yuxiang and Lu, Jinghui and Tang, Jingqun and Wang, Yanjie and Huang, Can},
  journal={arXiv preprint arXiv:2503.20680},
  year={2025}
}

Related Skills

node-connect

344.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

96.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

344.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

344.1k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。