VDLM

Repo for paper: https://arxiv.org/abs/2404.06479

Generate Convert Improve

Install / Use

/learn @MikeWangWZHL/VDLM

About this skill

Quality Score

0/100

README

<h1 align="center"> Visually Descriptive Language Model for Vector Graphics Reasoning </h1> <p align="center"> <a href="https://mikewangwzhl.github.io/VDLM">🌐 Homepage</a> • <a href="https://arxiv.org/abs/2404.06479">📃 Paper</a> • <a href="https://huggingface.co/datasets/mikewang/PVD-160K" >🤗 Data (PVD-160k)</a> • <a href="https://huggingface.co/mikewang/PVD-160k-Mistral-7b" >🤗 Model (PVD-160k-Mistral-7b)</a> • <a href="https://github.com/MikeWangWZHL/VDLM" >💻 Code</a> </p>

We observe that current large multimodal models (LMMs) still struggle with seemingly straightforward reasoning tasks that require precise perception of low-level visual details, such as identifying spatial relations or solving simple mazes. In particular, this failure mode persists in question-answering tasks about vector graphics—images composed purely of 2D objects and shapes.

Teaser

To solve this challenge, we propose Visually Descriptive Language Model (VDLM), a visual reasoning framework that operates with intermediate text- based visual descriptions—SVG representations and learned Primal Visual Description, which can be directly integrated into existing LLMs and LMMs. We demonstrate that VDLM outperforms state-of-the-art large multimodal models, such as GPT-4V, across various multimodal reasoning tasks involving vector graphics. See our paper for more details.

Overview

💻 Environment Setup

Minimum requirements:

conda env create -f environment.yml
conda activate vdlm

(Optional) For llava inference:

cd third_party
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .

(Optional) For ViperGPT inference:
```
cd third_party
git clone https://github.com/MikeWangWZHL/viper.git
```
Set up the environment for ViperGPT following the instructions.

🚀 Quick Start (Inference Demo)

Download the pretrained SVG-to-PVD model from here. It is an LLM finetuned from Mistral-7B-v0.1. Make sure it is stored at data/ckpts/PVD-160k-Mistral-7b
```
mkdir -p data/ckpts
cd data/ckpts
git lfs install
git clone https://huggingface.co/mikewang/PVD-160k-Mistral-7b
```

Serve the model with vllm:

CUDA_VISIBLE_DEVICES=0 ./vllm_serve_model.sh

A detailed inference demo 🚀 can be found here.

📊 Downstream Task Evaluation

Downstream Task Data Download

You can download the data for downstream tasks from here. Unzip the file and place the downstream_tasks folder under data/datasets/.

Run VDLM Perception: Image -> SVG -> PVD (in JSON format)

bash scripts/perception/eval_perception.sh

Run Reasoning: PVD + question -> answer

VDLM-mm:

GPT-4o:

bash scripts/reasoning/vdlm_mm_gpt4o_pvd.sh

GPT-4V:

bash scripts/reasoning/vdlm_mm_gpt4v_pvd.sh

VDLM-txt:

GPT-4 Chat API without Code Interpreter:

bash scripts/reasoning/vdlm_txt_gpt4_pvd.sh

GPT-4 Assistant API with Code Interpreter:

bash scripts/reasoning/vdlm_txt_gpt4_assistant_pvd.sh

Image-based Baselines:

GPT-4o + Image input:
```
bash scripts/reasoning/gpt4o_image.sh
```
GPT-4v + Image input:
```
bash scripts/reasoning/gpt4v_image.sh
```

Llava-v1.5 + Image input:

# 7b
bash scripts/reasoning/llava_1.5_7b_image.sh
# 13b
bash scripts/reasoning/llava_1.5_13b_image.sh

ViperGPT w/ GPT-4 + Image input:

bash scripts/reasoning/vipergpt_inference.sh

📂 SVG-to-PVD Model Data

PVD-160k Dataset

The dataset used for training our SVG-to-PVD model can be downloaded from here, which contains the preprocessed instruction-tuning data instances for training the SVG-to-PVD model. The format of each line is as follows:

{
    "id": "XXX",
    "conversations": [
        {"role": "system", "content": "XXX"},
        {"role": "user", "content": "XXX"},
        {"role": "assistant", "content": "XXX"}
        // ...
    ]
}

Additioanlly, the raw PNGs, SVGs and PVD annotations generated by our data generator can be downloaded from here.

Generating custom PVD data

pvd_data_generator/generate_pvd_img_svg.py provides the procedural data generator we used for generating the 160K Image/SVG/PVD pairs.

Example usage: bash pvd_data_generator/gen_dataset_pvd_160K.sh

To specify custom configurations, one can modify the main() function in pvd_data_generator/generate_pvd_img_svg.py.

Once generated the SVGs and PVD annotations, one can use the pvd_data_generator/get_instruction_pair.py to construct instruction-tuning data instances in vicuna or openai/mistral format. Modify the #TODO parts in the script with the generated custom dataset information. Then run: python pvd_data_generator/get_instruction_pair.py

📘 SVG-to-PVD Model Training

We finetune a Mistral-7B model using Megatron-LLM on the PVD-160K dataset. We follow https://github.com/xingyaoww/code-act/blob/main/docs/MODEL_TRAINING.md for doing the preprocessing and postprocessing on the model and data. We train the model on a SLURM cluster with 4 NVIDIA-A100-40GB GPUs.

Example usage:

clone the code-act repo:

cd third_party
git clone https://github.com/xingyaoww/code-act.git

Follow the instructions in https://github.com/xingyaoww/code-act/blob/main/docs/MODEL_TRAINING.md#environment-setup; for environmental setup, model preprocessing, data conversion.
Modify the TODO: items in scripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.slurm and scripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.sh
Copy scripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.slurm into code-act/scripts/slurm/configs; Copy scripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.sh into code-act/scripts/models/megatron.

Run training by:

cd third_party/code-act
sbatch scripts/slurm/configs/finetune_4xA100_4tp_mistral__pvd_3ep.slurm scripts/models/megatron/finetune_4xA100_4tp_mistral__pvd_3ep.sh

Follow https://github.com/xingyaoww/code-act/blob/main/docs/MODEL_TRAINING.md#convert-back-to-huggingface-format to convert the trained model back to Huggingface format. The converted model can be served with vllm for inference.

📚 Citation

@article{wang2024vdlm,
  title={Visually Descriptive Language Model for Vector Graphics Reasoning},
  author={Wang, Zhenhailong and Hsu, Joy and Wang, Xingyao and Huang, Kuan-Hao and Li, Manling and Wu, Jiajun and Ji, Heng},
  journal={arXiv preprint arXiv:2404.06479},
  year={2024}
}

Website License

This website's template is based on the Nerfies website.

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />The website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

Related Skills

node-connect

349.7k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

109.7k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

349.7k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

349.7k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。