VDLM
Repo for paper: https://arxiv.org/abs/2404.06479
Install / Use
/learn @MikeWangWZHL/VDLMREADME
We observe that current large multimodal models (LMMs) still struggle with seemingly straightforward reasoning tasks that require precise perception of low-level visual details, such as identifying spatial relations or solving simple mazes. In particular, this failure mode persists in question-answering tasks about vector graphics—images composed purely of 2D objects and shapes.

To solve this challenge, we propose Visually Descriptive Language Model (VDLM), a visual reasoning framework that operates with intermediate text- based visual descriptions—SVG representations and learned Primal Visual Description, which can be directly integrated into existing LLMs and LMMs. We demonstrate that VDLM outperforms state-of-the-art large multimodal models, such as GPT-4V, across various multimodal reasoning tasks involving vector graphics. See our paper for more details.

💻 Environment Setup
- Minimum requirements:
conda env create -f environment.yml conda activate vdlm - (Optional) For llava inference:
cd third_party git clone https://github.com/haotian-liu/LLaVA.git cd LLaVA pip install -e . - (Optional) For ViperGPT inference:
Set up the environment for ViperGPT following the instructions.cd third_party git clone https://github.com/MikeWangWZHL/viper.git
🚀 Quick Start (Inference Demo)
-
Download the pretrained SVG-to-PVD model from here. It is an LLM finetuned from Mistral-7B-v0.1. Make sure it is stored at
data/ckpts/PVD-160k-Mistral-7bmkdir -p data/ckpts cd data/ckpts git lfs install git clone https://huggingface.co/mikewang/PVD-160k-Mistral-7b -
Serve the model with vllm:
CUDA_VISIBLE_DEVICES=0 ./vllm_serve_model.sh -
A detailed inference demo 🚀 can be found here.
📊 Downstream Task Evaluation
Downstream Task Data Download
You can download the data for downstream tasks from here. Unzip the file and place the downstream_tasks folder under data/datasets/.
Run VDLM Perception: Image -> SVG -> PVD (in JSON format)
bash scripts/perception/eval_perception.sh
Run Reasoning: PVD + question -> answer
-
VDLM-mm:
- GPT-4o:
bash scripts/reasoning/vdlm_mm_gpt4o_pvd.sh - GPT-4V:
bash scripts/reasoning/vdlm_mm_gpt4v_pvd.sh
- GPT-4o:
-
VDLM-txt:
- GPT-4 Chat API without Code Interpreter:
bash scripts/reasoning/vdlm_txt_gpt4_pvd.sh - GPT-4 Assistant API with Code Interpreter:
bash scripts/reasoning/vdlm_txt_gpt4_assistant_pvd.sh
- GPT-4 Chat API without Code Interpreter:
-
Image-based Baselines:
- GPT-4o + Image input:
bash scripts/reasoning/gpt4o_image.sh - GPT-4v + Image input:
bash scripts/reasoning/gpt4v_image.sh - Llava-v1.5 + Image input:
# 7b bash scripts/reasoning/llava_1.5_7b_image.sh # 13b bash scripts/reasoning/llava_1.5_13b_image.sh - ViperGPT w/ GPT-4 + Image input:
bash scripts/reasoning/vipergpt_inference.sh
- GPT-4o + Image input:
📂 SVG-to-PVD Model Data
PVD-160k Dataset
The dataset used for training our SVG-to-PVD model can be downloaded from here, which contains the preprocessed instruction-tuning data instances for training the SVG-to-PVD model. The format of each line is as follows:
{
"id": "XXX",
"conversations": [
{"role": "system", "content": "XXX"},
{"role": "user", "content": "XXX"},
{"role": "assistant", "content": "XXX"}
// ...
]
}
Additioanlly, the raw PNGs, SVGs and PVD annotations generated by our data generator can be downloaded from here.
<!-- By default, the dataset is stored in `data/datasets/pretraining_data/pvd_160k.jsonl`. -->Generating custom PVD data
pvd_data_generator/generate_pvd_img_svg.py provides the procedural data generator we used for generating the 160K Image/SVG/PVD pairs.
Example usage: bash pvd_data_generator/gen_dataset_pvd_160K.sh
To specify custom configurations, one can modify the main() function in pvd_data_generator/generate_pvd_img_svg.py.
Once generated the SVGs and PVD annotations, one can use the pvd_data_generator/get_instruction_pair.py to construct instruction-tuning data instances in vicuna or openai/mistral format. Modify the #TODO parts in the script with the generated custom dataset information. Then run: python pvd_data_generator/get_instruction_pair.py
📘 SVG-to-PVD Model Training
We finetune a Mistral-7B model using Megatron-LLM on the PVD-160K dataset. We follow https://github.com/xingyaoww/code-act/blob/main/docs/MODEL_TRAINING.md for doing the preprocessing and postprocessing on the model and data. We train the model on a SLURM cluster with 4 NVIDIA-A100-40GB GPUs.
Example usage:
-
clone the code-act repo:
cd third_party git clone https://github.com/xingyaoww/code-act.git -
Follow the instructions in https://github.com/xingyaoww/code-act/blob/main/docs/MODEL_TRAINING.md#environment-setup; for environmental setup, model preprocessing, data conversion.
-
Modify the
TODO:items inscripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.slurmandscripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.sh -
Copy
scripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.slurmintocode-act/scripts/slurm/configs; Copyscripts/training/finetune_4xA100_4tp_mistral__pvd_3ep.shintocode-act/scripts/models/megatron. -
Run training by:
cd third_party/code-act sbatch scripts/slurm/configs/finetune_4xA100_4tp_mistral__pvd_3ep.slurm scripts/models/megatron/finetune_4xA100_4tp_mistral__pvd_3ep.sh -
Follow https://github.com/xingyaoww/code-act/blob/main/docs/MODEL_TRAINING.md#convert-back-to-huggingface-format to convert the trained model back to Huggingface format. The converted model can be served with
vllmfor inference.
📚 Citation
@article{wang2024vdlm,
title={Visually Descriptive Language Model for Vector Graphics Reasoning},
author={Wang, Zhenhailong and Hsu, Joy and Wang, Xingyao and Huang, Kuan-Hao and Li, Manling and Wu, Jiajun and Ji, Heng},
journal={arXiv preprint arXiv:2404.06479},
year={2024}
}
Website License
This website's template is based on the Nerfies website.
<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />The website is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.
Related Skills
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.7kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
