SkillAgentSearch skills...

VCoder

[CVPR 2024] VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Install / Use

/learn @SHI-Labs/VCoder
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

✌️ VCoder: Versatile Vision Encoders for Multimodal Large Language Models

Framework: PyTorch HuggingFace space YouTube

Jitesh Jain, Jianwei Yang, Humphrey Shi

[Project Page] [COST Dataset] [arXiv] [pdf] [Video] [BibTeX]

This repo contains the code for our paper VCoder: Versatile Vision Encoders for Multimodal Large Language Models.

<p align="center"> <img src="images/features.svg" width="100%" class="center"/> </p> <p align="center"> <img src="images/vcoder.svg" width="100%" class="center"/> </p>

Contents

  1. Installation Instructions
  2. Demo
  3. Dataset Preparation
  4. Getting Started
  5. Results
  6. Citation

News

  • [December 29, 2023]: Our demo is now available on HuggingFace Spaces. Thanks to the HF team for their support! 🤗
  • [December 21, 2023]: Project Page, Dataset, ArXiv Preprint and GitHub Repo are public! 🚀
    • 🎯 VCoder is an adapter for improving MLLMs at object-level perception tasks with the aid of auxiliary perception modalities as control inputs.
    • 🎁 We also release the COST dataset to train and evaluate MLLMs at object-level perception tasks!
    • 🥁 VCoder LLaVA-1.5 and VCoder-DS LLava-1.5 checkpoints are available on HuggingFace Hub!
    • 👨🏻‍💻 [COMING SOON] VCoder (IT) LLaVA-1.5 trained on a mix of instruction-tuning data and COST dataset!

Installation Instructions

We use Python 3.10 and PyTorch 2.0.1 (CUDA 11.7 build) on Ubuntu 20.04.3 LTS.

  • Clone this repository.

    git clone https://github.com/SHI-Labs/VCoder
    cd VCoder
    
  • Setup conda environment.

    conda create -n vcoder python=3.10 -y
    conda activate vcoder
    pip install --upgrade pip
    conda install -c "nvidia/label/cuda-11.7.0" cuda-toolkit
    conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.7 -c pytorch -c nvidia
    pip install -e .
    pip install ninja
    pip install flash-attn --no-build-isolation
    
  • Install additional packages for evaluation.

    python -m spacy download en_core_web_sm
    pip install --user -U nltk
    

Demo

HuggingFace space

You can use one of the CLI or Gradio interface to interact with VCoder LLaVA-1.5 locally.

Note: You can obtain the segmentation map from the OneFormer Demo and the depth map from DINOv2.

Gradio Interface

Run the following command:

CUDA_VISIBLE_DEVICES=0 python -m vcoder_llava.serve.gradio_app --model-path shi-labs/vcoder_ds_llava-v1.5-13b

CLI Inference

Run the following command:

CUDA_VISIBLE_DEVICES=0 python -m vcoder_llava.serve.cli \
    --model-path shi-labs/vcoder_ds_llava-v1.5-13b \
    --image-file "vcoder_llava/serve/examples/suits.jpg" \
    --seg-image-file "vcoder_llava/serve/examples/suits_pan.png" \ # optional [reqd with depth input]
    --depth-image-file "vcoder_llava/serve/examples/suits_depth.jpeg" \ # optional
    --load-4bit # optional, you may also use --load-8bit

Getting Started

Please see Getting Started with VCoder for training and evaluation commands.

Results

Note that we do not finetune any parameters in the original LLaVA-1.5 models, so VCoder's performance on general question answering benchmarks is the same as LLaVA-1.5 .

Benchmarking on COST

| Model | Semantic | Instance | Panoptic | Depth | Checkpoint | |---------|:-------------:|:-------------:|:-------------:|:-------------:|:-------------:| | |CS(↑)/HS(↓)|CS(↑)/HS(↓)|CS(↑)/HS(↓)|DS(↓)| | | VCoder LLaVA-1.5-7b | 88.6/10.4 | 71.1/26.9 | 86.0/12.8 | - | HF Hub | | VCoder LLaVA-1.5-13b | 89.0/10.0 | 73.3/25.0 | 87.2/11.6 | - | HF Hub | | VCoder-DS LLaVA-1.5-7b | 87.8/11.5 | 69.9/28.5 | 86.8/12.4 | 65.9 | HF Hub | | VCoder-DS LLaVA-1.5-13b | 88.5/10.9 | 71.7/26.3 | 88.5/10.8 | 63.3 | HF Hub |

We release the model responses used for benchmarking here.

Citation

If you found VCoder useful in your research, please consider starring ⭐ us on GitHub and citing 📚 us in your research!

@article{jain2023vcoder,
    title={{VCoder: Versatile Vision Encoders for Multimodal Large Language Models}},
    author={Jitesh Jain and Jianwei Yang and Humphrey Shi},
    journal={arXiv},
    year={2023}
}

Acknowledgement

We thank the authors of LLaVA, OneFormer, and DINOv2 for open-sourcing their codebase and checkpoints. We are also grateful to the authors of CHAIR for releasing their synonym word mapping.

Related Skills

View on GitHub
GitHub Stars279
CategoryDevelopment
Updated1mo ago
Forks16

Languages

Python

Security Score

95/100

Audited on Feb 5, 2026

No findings