CoLLaVO

[ACL 2024 Findings] Official PyTorch Implementation code for realizing the technical part of CoLLaVO: Crayon Large Language and Vision mOdel to significantly improve zero-shot vision language performances

Generate Convert Improve

Install / Use

/learn @ByungKwanLee/CoLLaVO

About this skill

Quality Score

0/100

README

<img src="figures/crayon_emoji.png" style="vertical-align: -10px;" :height="50px" width="50px"> CoLLaVO: Crayon Large Language and Vision mOdel [arxiv]

📰 News

CoLLaVO is now available in 🤗Huggingface Space.
CoLLaVO is featured by Huggingface Daily Papers.
A new model, MoAI is either released in [Paper]/[Github]/[Huggingface].

crayon_demo

🎨 In-Progress

[x] Code is public (Only Inference Supported).
[x] Downloading CoLLaVO-7B is available in Huggingface.
[x] Huggingface README.md for simple running
[x] Short running code for an image example is available.
[ ] Uploading GPT-Aided Evaluation

Official PyTorch implementation code for realizing the technical part of Crayon Large Language and Vision mOdel (CoLLaVO) to improve performance of numerous zero-shot vision language tasks. This code is developed on two baseline codes of XDecoder: Generalized Decoding for Pixel, Image, and Language accepted in CVPR 2023 and InternLM for Technical Paper.

🏝️ Summary

The remarkable success of Large Language Models (LLMs) and instruction tuning drives the evolution of Vision Language Models (VLMs) towards a versatile general-purpose model. Yet, it remains unexplored whether current VLMs genuinely possess quality object-level image understanding capabilities determined from 'what objects are in the image?' or 'which object corresponds to a specified bounding box?'. Our findings reveal that the image understanding capabilities of current VLMs are strongly correlated with their zero-shot performance on vision language (VL) tasks. This suggests that prioritizing basic image understanding is crucial for VLMs to excel at VL tasks. To enhance object-level image understanding, we propose Crayon Large Language and Vision mOdel (<img src="figures/crayon_emoji.png" style="vertical-align: -5px;" :height="20px" width="20px"> CoLLaVO), which incorporates instruction tuning with Crayon Prompt as a new visual prompt tuning scheme based on panoptic color maps. Furthermore, we present a learning strategy of Dual QLoRA to preserve object-level image understanding without forgetting it during visual instruction tuning, thereby achieving a significant leap in numerous VL benchmarks in a zero-shot setting.

🚀 Highlights

<img src="figures/fig_close.png" width="511" height="350"> <figcaption> Figure. Zero-shot performance of CoLLaVO-7B on challenging VL datasets compared with closed-source VLMs: GPT-4V, Gemini-Pro, Qwen-VL-Plus. Note: The scores of MME are rescaled by 1/20 to match the scales with the accuracies of others. </figcaption> <img src="figures/fig_seed.png" width="532" height="350"> <figcaption> Figure. Demonstrating the efficiency and effectiveness of CoLLaVO compared with those of other VLMs. Note that accuracy is measured on SEED-IMG. </figcaption> <img src="figures/fig_pope.png" width="661" height="350"> <figcaption> Table. Measuring four metrics: Accuracy, Precision, Recall, F1-score on three types of question answering to evaluate hallucination of vision language models: Adversarial, Random, and Popular in POPE. </figcaption>

📖 Citation

@article{lee2024collavo,
  title={CoLLaVO: Crayon Large Language and Vision mOdel},
  author={Lee, Byung-Kwan and Park, Beomchan and Kim, Chae Won and Ro, Yong Man},
  journal={arXiv preprint arXiv:2402.11248},
  year={2024}
}

<img src="figures/crayon_emoji.png" style="vertical-align: -2px;" :height="20px" width="20px"> Download CoLLaVO-7B (Under Preparation)

| | GQA | SQA-IMG | TextVQA | POPE | MME-P | MME-C | MM-Bench | MMB-CN | MM-Vet | Q-Bench | |-----------------|:--------:|:--------:|:--------:|:--------:|:----------:|:---------:|:--------:|:--------:|:--------:|:--------:| | BLIP2-13B | 42.4 | 61.0 | 42.5 | 85.3 | 1293.8 | 290.0 | - | - | 22.4 | - | | InstructBLIP-7B | 49.5 | 49.2 | 60.5 | 50.1 | - | - | 36.0 | 23.7 | 25.6 | 56.7 | | Qwen-VL-Chat-7B | 57.5 | 68.2 | 61.5 | - | 1487.5 | 360.7 | 60.6 | 56.7 | - | - | | LLaVA1.5-7B | 62.0 | 66.8 | 58.2 | 85.9 | 1510.7 | 293.8 | 64.3 | 58.3 | 30.5 | 58.7 | | CoLLaVO-7B | 61.4 | 80.7 | 64.2 | 87.2 | 1689.7 | 525.0 | 83.0 | 82.1 | 40.3 | 67.6 |

📂 Directory Layout

.
├── asset                           # Required package lists (Important)
├── trainer                         # Training CoLLaVO and initializing optimizer (Not Support Now)
├── utils                           # Michallengeous util files (Not important)
├── collavo                         # CoLLaVO architecture & loading collavo (Important)
├── pipeline                        # Evaluating zero-shot vision language tasks (Important)
│
├── datasets                        # Important
│   ├── dataset_mappers             # data parsing including augmentation for loader
│   ├── evaluation                  # measuring evaluation for each dataset 
│   └── registration                # register dataset
│
├── configs                         
│   ├── accel                       # Accelerate Config files (Support DDP)
│   └── collavo_eval.yaml           # Config of evaluating collavo
│
├── modeling                        # Not Important
│   ├── architectures               # training the prototype of collavo (Not Support Now)
│   ├── utils                       # utils for modeling (Not important)
│   └── BaseModel                   # loading and saving model
│
├── lbk_entry.py                    # main code of control tower (Important)
├── run                             # bash file for running the evaluation (Important)
│
├── install                         # install required packages (Important)
└── README.md

💡 How to Run?

In bash file of install, you should first run the following lines.

conda create -n collavo python=3.9
conda activate collavo
conda clean -a && pip cache purge
conda install pytorch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r assets/requirements/requirements.txt
pip install -r assets/requirements/requirements_custom.txt
pip install flash-attn --no-build-isolation

In addition, you should set the following environment variables to set the dataset path.

export DETECTRON2_DATASETS=/path/to/dataset
export DATASET=/path/to/dataset
export DATASET2=/path/to/dataset
export VLDATASET=/path/to/dataset

Download CoLLaVO-7B Model and then you can run the demo.py

"""
CoLLaVO-7B

Simple Six Steps
"""

# [1] Loading Image
from PIL import Image
from torchvision.transforms import Resize
from torchvision.transforms.functional import pil_to_tensor
image_path = "figures/crayon_image.jpg"
image = Resize(size=(490, 490), antialias=False)(pil_to_tensor(Image.open(image_path)))

# [2] Instruction Prompt
prompt = "Describe this image in detail"

# [3] Loading CoLLaVO
from collavo.load_collavo import prepare_collavo
collavo_model, collavo_processor, seg_model, seg_processor = prepare_collavo(collavo_path='BK-Lee/CoLLaVO-7B', bits=4, dtype='fp16')

# [4] Pre-processing for CoLLaVO
collavo_inputs = collavo_model.demo_process(image=image, 
                                    prompt=prompt, 
                                    processor=collavo_processor,
                                    seg_model=seg_model,
                                    seg_processor=seg_processor,
                                    device='cuda:0')

# [5] Generate
import torch
with torch.inference_mode():
    generate_ids = collavo_model.generate(**collavo_inputs, do_sample=True, temperature=0.9, top_p=0.95, max_new_tokens=256, use_cache=True)

# [6] Decoding
answer = collavo_processor.batch_decode(generate_ids, skip_special_tokens=True)[0].split('[U')[0]
print(answer)

If you want to valiate zero-shot performances in numerous datasets, then running the bash file 'run'.

# CoLLaVO-Experiment
GPU_DEVICE="0,1,2,3,4,5"
length=${#GPU_DEVICE}
n_gpu=$(((length+1)/2))
main_port=10000
test_batch=1

CUDA_VISIBLE_DEVICES=$GPU_DEVICE \
accelerate launch --config_file configs/accel/ddp_accel.yaml \
    --num_processes=$n_gpu \
    --main_process_port=$main_port \
    lbk_entry.py eval \
    --conf_files configs/collavo_eval.yaml \
    --overrides \
    WANDB False \
    DATASETS.TEST mme \
    PIPELINE MMEPipeline \
    MME.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    SCIENCEQA.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    POPE.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MMBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MMVET.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    AI2D.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    HALLUSIONBENCH.TEST.BATCH_SIZE_TOTAL $((n_gpu * test_batch)) \
    MATHVIST

Related Skills

node-connect

350.8k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

110.4k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

350.8k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

350.8k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。