VoRA
[Fully open] [Encoder-free MLLM] Vision as LoRA
Install / Use
/learn @Hon-Wong/VoRAREADME
VoRA: Integrating Visual Capabilities into LLMs
<div align="center"> <p style="font-size: larger; margin-top: -5px;"> <a href="https://arxiv.org/pdf/2503.20680">Vision as LoRA</a> </p> <div align="center" style="width: 100%; margin: 0 auto;"> <img src="assets/framework.gif" alt="Framework" width="70%"> </div> </div>News
- 2025-04-16: Training code released.
- 2025-04-06: LMMs-Eval has supported VoRA.
- 2025-04-04: VoRA Weights and training data are released.
Install
Clone this repository and install dependencies:
git clone https://github.com/Hon-Wong/VoRA.git
cd VoRA
pip3 install -e .
Data Preparation
We have collected or generated the following datasets for VoRA. The image bytes are also included, so there’s no need to download the images from URLs. The captions were created using a variety of prompts to ensure diversity.
| HF dataset 🤗 | #Samples | Source | Generated by | | ------------ | -------------- | ------------ | ------------ | | VoRA-Recap-8M | 8M | DataComp-1B | Qwen2-VL-72B-Instruct | | VoRA-Recap-29M | 29M | DataComp-1B | Qwen2-VL-72B-Instruct | | VoRA-Recap-GLDv2-1.4M | 1.4M | GLDv2 | Qwen2-VL-72B-Instruct | | VoRA-TextQA-Mixed | 6.3M | Cambrian, LLaVA-ov, Infinity-Instruction, SmolTalk | - |
- download the pre-training datasets from HF.
datasets for ablation study
apt-get install git-lfs
git-lfs install
cd {raw_data_dir}
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-Recap-8M
datasets for Pre-training
apt-get install git-lfs
git-lfs install
cd {raw_data_dir}
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-Recap-29M
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-Recap-GLDv2-1.4M
git clone https://huggingface.co/datasets/Hon-Wong/VoRA-TextQA-Mixed
- convert parquet to json
For ablation:
python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-Recap-8M --save_dir={data_dir}/VoRA-Recap-8M
For Pre-training:
python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-Recap-29M --save_dir={data_dir}/VoRA-Recap-29M
python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-Recap-GLDv2-1.4M --save_dir={data_dir}/VoRA-Recap-GLDv2-1.4M
python3 tools/parquet2json.py --dataset_dir={raw_data_dir}/VoRA-TextQA-Mixed --save_dir={data_dir}/VoRA-TextQA-Mixed
- Prepare LLaVA-mixture
convert it to VoRA's format:
{
"id": "00000000",
"frames": [
"frames/00000000.jpg",
],
"conversations": [
{
"from": "human",
"value": "Describe this image in detail."
},
{
"from": "gpt",
"value": "<image>\nThis image is a ..."
}
]
}
Also, if you want to use your own data, simply follow the step to format.
Training
- Pre-training
Set the config file in configs/pretrain_I30M_T6M.yaml. Make sure the global batchsize is 256. Change the data path and model path to your local ones.
Train VoRA on a single node with 8 GPUs:
deepspeed --master_port=20000 train/train.py configs/pretrain_I30M_T6M.yaml
Train VoRA on multi nodes:
torchrun --nproc_per_node 8 --nnodes 4 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT train/train.py configs/pretrain_I30M_T6M.yaml
- Finetuning
Merge LoRA weights into the base model:
python3 tools/merge_lora.py --config=configs/pretrain_I30M_T6M.yaml --checkpoint={your_checkpoint_dir} --save_dir={your_save_dir}
Then set the model.pretrained in configs/finetune.yaml, and run:
deepspeed --master_port=20000 train/train.py configs/finetune.yaml
Evaluation
The original results in the paper were evaluated using a suite similar to LLaVA. Alternatively, you can use LMMs-Eval to evaluate the model.
- Install LMMs-Eval:
git clone https://github.com/EvolvingLMMs-Lab/lmms-eval
cd lmms-eval
pip3 install -e .
- Evaluate
Evaluate the checkpoints in the paper:
export HF_TOKEN={your_hf_token}
python3 -m accelerate.commands.launch --num_processes=8 --main_process_port=51999 -m lmms_eval --tasks textvqa_val --model vora --model_args pretrained=Hon-Wong/VoRA-7B-Instruct --batch_size 1 --log_samples --output_path ./logs/
Evaluate your own model:
export HF_TOKEN={your_hf_token}
cp generation_files/* {your_model_dir}
python3 -m accelerate.commands.launch --num_processes=8 --main_process_port=51999 -m lmms_eval --tasks textvqa_val --model vora --model_args pretrained={your_model_dir} --batch_size 1 --log_samples --output_path ./logs/
Citation
If you find this repository useful, please consider citing and starring it:
@article{wang2025vision,
title={Vision as LoRA},
author={Wang, Han and Ye, Yongjie and Li, Bingru and Nie, Yuxiang and Lu, Jinghui and Tang, Jingqun and Wang, Yanjie and Huang, Can},
journal={arXiv preprint arXiv:2503.20680},
year={2025}
}
Related Skills
node-connect
344.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
96.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
344.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
344.1kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
