VCoder
[CVPR 2024] VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Install / Use
/learn @SHI-Labs/VCoderREADME
✌️ VCoder: Versatile Vision Encoders for Multimodal Large Language Models
Jitesh Jain, Jianwei Yang, Humphrey Shi
[Project Page] [COST Dataset] [arXiv] [pdf] [Video] [BibTeX]
This repo contains the code for our paper VCoder: Versatile Vision Encoders for Multimodal Large Language Models.
<p align="center"> <img src="images/features.svg" width="100%" class="center"/> </p> <p align="center"> <img src="images/vcoder.svg" width="100%" class="center"/> </p>Contents
News
- [December 29, 2023]: Our demo is now available on HuggingFace Spaces. Thanks to the HF team for their support! 🤗
- [December 21, 2023]: Project Page, Dataset, ArXiv Preprint and GitHub Repo are public! 🚀
- 🎯 VCoder is an adapter for improving MLLMs at object-level perception tasks with the aid of auxiliary perception modalities as control inputs.
- 🎁 We also release the COST dataset to train and evaluate MLLMs at object-level perception tasks!
- 🥁 VCoder LLaVA-1.5 and VCoder-DS LLava-1.5 checkpoints are available on HuggingFace Hub!
- 👨🏻💻 [COMING SOON] VCoder (IT) LLaVA-1.5 trained on a mix of instruction-tuning data and COST dataset!
Installation Instructions
We use Python 3.10 and PyTorch 2.0.1 (CUDA 11.7 build) on Ubuntu 20.04.3 LTS.
-
Clone this repository.
git clone https://github.com/SHI-Labs/VCoder cd VCoder -
Setup conda environment.
conda create -n vcoder python=3.10 -y conda activate vcoder pip install --upgrade pip conda install -c "nvidia/label/cuda-11.7.0" cuda-toolkit conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia pip install -e . pip install ninja pip install flash-attn --no-build-isolation -
Install additional packages for evaluation.
python -m spacy download en_core_web_sm pip install --user -U nltk
Demo
You can use one of the CLI or Gradio interface to interact with VCoder LLaVA-1.5 locally.
Note: You can obtain the segmentation map from the OneFormer Demo and the depth map from DINOv2.
Gradio Interface
Run the following command:
CUDA_VISIBLE_DEVICES=0 python -m vcoder_llava.serve.gradio_app --model-path shi-labs/vcoder_ds_llava-v1.5-13b
CLI Inference
Run the following command:
CUDA_VISIBLE_DEVICES=0 python -m vcoder_llava.serve.cli \
--model-path shi-labs/vcoder_ds_llava-v1.5-13b \
--image-file "vcoder_llava/serve/examples/suits.jpg" \
--seg-image-file "vcoder_llava/serve/examples/suits_pan.png" \ # optional [reqd with depth input]
--depth-image-file "vcoder_llava/serve/examples/suits_depth.jpeg" \ # optional
--load-4bit # optional, you may also use --load-8bit
Getting Started
Please see Getting Started with VCoder for training and evaluation commands.
Results
Note that we do not finetune any parameters in the original LLaVA-1.5 models, so VCoder's performance on general question answering benchmarks is the same as LLaVA-1.5 .
Benchmarking on COST
| Model | Semantic | Instance | Panoptic | Depth | Checkpoint | |---------|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:| | |CS(↑)/HS(↓)|CS(↑)/HS(↓)|CS(↑)/HS(↓)|DS(↓)| | | VCoder LLaVA-1.5-7b | 88.6/10.4 | 71.1/26.9 | 86.0/12.8 | - | HF Hub | | VCoder LLaVA-1.5-13b | 89.0/10.0 | 73.3/25.0 | 87.2/11.6 | - | HF Hub | | VCoder-DS LLaVA-1.5-7b | 87.8/11.5 | 69.9/28.5 | 86.8/12.4 | 65.9 | HF Hub | | VCoder-DS LLaVA-1.5-13b | 88.5/10.9 | 71.7/26.3 | 88.5/10.8 | 63.3 | HF Hub |
We release the model responses used for benchmarking here.
Citation
If you found VCoder useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!
@article{jain2023vcoder,
title={{VCoder: Versatile Vision Encoders for Multimodal Large Language Models}},
author={Jitesh Jain and Jianwei Yang and Humphrey Shi},
journal={arXiv},
year={2023}
}
Acknowledgement
We thank the authors of LLaVA, OneFormer, and DINOv2 for open-sourcing their codebase and checkpoints. We are also grateful to the authors of CHAIR for releasing their synonym word mapping.
Related Skills
node-connect
341.0kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
341.0kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.4kCommit, push, and open a PR
