V2PE
[ArXiv] V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
Install / Use
/learn @OpenGVLab/V2PEREADME
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
The official implementation of the paper "V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding".
<div align="center"> <img src="assets/fig1_hf_00.png" alt="drawing" width="600"/> </div> <div align="center">[🆕 Blog] [📜 ArXiv Paper] [🤗 HF Models] [📖 HF Datasets]
</div>📖 Summary
The main contributions of this work are as follows:
- We construct mixed datasets for VLMs' long-context training and evaluation by augmenting existing multimodal instruction tuning datasets and conduct a thorough investigation into why current VLMs struggle with long-context multimodal inputs, revealing that directly applying LLM positional encoding to visual tokens is ineffective.
- We propose Variable Visual Position Encoding (V2PE), a novel positional encoding strategy that employs variable and smaller increments for visual tokens, significantly enhancing VLMs' ability to understand and reason over long multimodal contexts.
- We apply our V2PE method and extend training data on the open-source VLM, InternVL2-2B. The fine-tuned VLM performs exceptionally well on both general multimodal benchmarks and long-context multimodal tasks, with the capacity to handle sequences of up to 1M tokens.
🛠️ Installation
See INSTALLATION.md
In addition, using this codebase requires executing the following steps:
-
Install other requirements:
pip install --upgrade pip # enable PEP 660 support pip install -e .
📦 Model Preparation
Our models are built from InternVL2-2B.
Please download the above model weights and place them in the pretrained/ folder.
| model name | type | download | size | | ----------------------- |------| ---------------------------------------------------------------------- |:------:| | InternVL2-2B | VLM | 🤗 HF link | 4.4 GB |
cd pretrained/
# pip install -U huggingface_hub
huggingface-cli download --resume-download --local-dir-use-symlinks False OpenGVLab/InternVL2-2B --local-dir InternVL2-2B
The directory structure is:
pretrained
└── InternVL2-2B/
🔥 Supervised Fine-tuning
Prepare Training Datasets
-
Download training and validation dataset from HuggingFace
-
Organize the data as follows in
dataset/:dataset ├── annotation │ ├── long_mr_128k/ │ ├── long_mr_256k/ │ ├── long_mr_32k/ │ ├── long_vqa_32k/ │ ├── milebench_16k/ │ └── milebench_nh/ ├── image │ ├── long_mr │ │ ├── train/ │ │ └── val/ │ ├── long_vqa │ │ ├── image │ │ │ ├── deepform │ │ │ │ ├── train/ │ │ │ │ └── val/ │ │ │ ├── docvqa │ │ │ │ ├── train/ │ │ │ │ └── val/ │ │ │ ├── infovqa │ │ │ │ ├── train/ │ │ │ │ └── val/ │ │ │ ├── kleistercharity │ │ │ │ ├── train/ │ │ │ │ └── val/ │ │ │ ├── svqa │ │ │ │ ├── train/ │ │ │ │ └── val/ │ │ │ └── visualmrc │ │ │ ├── train/ │ │ │ └── val/ │ │ └── paste │ │ ├── chartqa │ │ │ ├── train/ │ │ │ └── val/ │ │ ├── clevr │ │ │ ├── train/ │ │ │ └── val/ │ │ ├── dvqa │ │ │ ├── train/ │ │ │ └── val/ │ │ ├── gqa │ │ │ ├── train/ │ │ │ └── val/ │ │ ├── ocrvqa │ │ │ ├── train/ │ │ │ └── val/ │ │ ├── okvqa │ │ │ ├── train/ │ │ │ └── val/ │ │ ├── tabfact │ │ │ ├── train/ │ │ │ └── val/ │ │ ├── textcaps │ │ │ ├── train/ │ │ │ └── val/ │ │ ├── textvqa │ │ │ ├── train/ │ │ │ └── val/ │ │ ├── vizwiz │ │ │ ├── train/ │ │ │ └── val/ │ │ └── wikitablequestions │ │ ├── train/ │ │ └── val/ │ └── milebench │ ├── clevr │ │ └── train/ │ ├── gpr │ │ └── train/ │ ├── iedit │ │ └── train/ │ ├── mmcoqa │ │ └── train/ │ ├── mmqa │ │ └── train/ │ ├── nh │ │ └── train/ │ ├── objintercn │ │ └── train/ │ ├── ocrvqa │ │ └── train/ │ ├── percept │ │ └── train/ │ ├── slidevqa │ │ └── train/ │ ├── spotdiff │ │ └── train/ │ ├── sta_charades │ │ └── train/ │ ├── star │ │ └── train/ │ ├── tqa │ │ └── train/ │ └── webqa │ └── train/ └── val ├── long_mr_128k/ ├── long_mr_1m/ ├── long_mr_256k/ ├── long_mr_512k/ ├── long_vqa_32k/ ├── long_vqa_40k/ ├── long_vqa_48k/ ├── long_vqa_56k/ └── long_vqa_64k/
Start Training
We provide slurm scripts for multi-node multi-GPU training. You can use 32 GPUs to train this model, and it will take approximately 48 hours.
# using 32 GPUs
PARTITION='your partition' GPUS=32 sh shell/internlm2_2b/internvl_chat_v2_internlm2_2b_dynamic_res_v2pe_32k.sh
Training using ring-attention
When training on 256k length or longer dataset, you may need using ring attention to limit GPU memory usage. To use ring attention, you need to set two variables in the training script:
--chunk_num 8 \
--attn_type 'ring' \
Here, chunk_num specifies the number of chunks each sample is split into, which are distributed across chunk_num GPUs. The use_chunkTrainer flag indicates that ring attention is used during training.
We provide an example training script that utilizes ring attention at: shell/internlm2_2b/internvl_chat_v2_internlm2_2b_dynamic_res_v2pe_256k.sh. You can run this script with the following command:
# using 32 GPUs
PARTITION='your partition' GPUS=32 sh shell/internlm2_2b/internvl_chat_v2_internlm2_2b_dynamic_res_v2pe_256k.sh
📊 Evaluation
Evaluation results in paper
General MLLM Benchmarks

Long-Context MLLM Benchmarks

Evaluation results of our released model
After organizing our codebase and training a released model, we renewed our evaluation results of the released model as follows:
General MLLM Benchmarks
| Model | #Param | ChartQA | DocVQA | AI2D | InfoVQA | SQA | POPE | MMMU<sub>val</sub> | MMBench<sub>EN</sub> | SEED<sub>I</sub> | Avg | |---------------------------|--------|---------|--------|-------|---------|-------|-------|--------------------|---------------------|------------------|-------| | InternVL2-2B | 2.0B | 71.7 | 86.9 | 74.1 | 58.9 | 94.1 | 85.2 | 36.3 | 73.4 | 70.9 | 72.4 | | DeepSeek-VL-1.3B | 2.0B | 47.4 | - | 51.5 | - | 68.4 | 85.9 | 33.8 | 66.4 | 66.0 | - | | Qwen2-VL-2B | 2.0B | 73.5 | 90.1 | 74.7 | 65.5 | - | - | 41.1 | 74.9 | - | - | | Aquila-VL-2B | 2.2B | 32.0 | 85.0 | 75.1 | 58.3 | 95.1 | 83.1 | 46.9 | 79.0 | 73.9 | 69.8 | | MiniCPM-V-2 | 2.8B | 55.6 | 71.9 | 62.9 | - | 80.7 | 86.3 | 38.2 | 64.1 | 67.1 | - | | Vintern-3B-beta | 3.7B | 68.3 | - | 69.1 | - | 75.0 | 87.4 | 46.7 | 70.6 | 70.0 | - | | Llama 3.2 11B | 11B | 83.4 | 88.4 | 91.1 | - | - | - | 50.7 | 68.0 | - | - | | Qwen2-VL-72B | 73B | 88.3 | 96.5 | 88.1 | 84.5 | 91.2 | 87.2 | 64.5 | 86.9 | 77.9 | 85.0 | | GPT-4o | - | 85.7 | 92.8 | 84.7 | - | 90.1 | 97.2 | 69.1 | 82.1 | 76.7 | - | | InternVL2-V2PE-32K | 2.0B | 76.4 | 83.9 | 73.2 | 55.9 | 94.9 | 88.8 | 36.6 | 73.5 | 71.2 | 72.5 |
Long-Context MLLM Benchmarks
| Model | #Param | MM-NIAH/Image | MM-NIAH/Text | MM-NIAH/Avg | Milebench/T | Milebench/S | Milebench/NI | Milebench/Avg | VideoMME | MVBench | |--------------------------|--------|---------------|--------------|-------------|--------------|--------------|---------------|--------------|------------|------------| | InternVL2-2B | 2.0B | 23.0 | 18.9 | 21.0 | 58.2 | 54.5 | 37.0 | 49.9 | - | - | | Phi-3-Vision | 2.7B | - | - | - | 46.9 | 50.0 | - | - | - | - | | OmChat | 3.9B | - | - | - | 51.4 | 52.0 | - | - | 45.9 | 50.2
Related Skills
node-connect
340.5kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
84.2kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
340.5kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
84.2kCommit, push, and open a PR
