VITA
✨✨[NeurIPS 2025] VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Install / Use
/learn @VITA-MLLM/VITAREADME
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
<p align="center"> <img src="./asset/vita_newlog.jpg" width="100%" height="100%"> </p><font size=7><div align='center' > [📖 VITA-1.5 Paper] [🤖 Basic Demo] [🍎 VITA-1.0] [💬 WeChat (微信)]</div></font>
<p align="center"> <img src="./asset/vita_demo.jpg" width="80%" height="80%"> </p>
<font size=7><div align='center' > [📽 VITA-1.5 Demo Show! Here We Go! 🔥] </div></font>
<font size=7><div align='center' > VITA-1.5 supports both English and Chinese.🌟 </div></font>
You can experience our Basic Demo on ModelScope directly. The Real-Time Interactive Demo needs to be configured according to the instructions.
🔥 News
2025.01.17🌟 ModelScope has supported VITA-1.5! You could try our Basic Demo on it!2025.01.06🌟 VLMEvalKit of OpenCompass has supported our both VITA-1.5 and VITA-1.0 models!2025.01.06🌟 The technical report of VITA-1.5 has been released!2024.12.20🌟 We are excited to introduce the VITA-1.5, a more powerful and more real-time version!2024.08.12🌟 We are very proud to launch VITA-1.0, the First-Ever open-source interactive omni multimodal LLM! We have submitted the open-source code, yet it is under review internally. We are moving the process forward as quickly as possible, stay tuned!
Contents <!-- omit in toc -->
- VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
👀 VITA-1.5 Overview
On 2024.08.12, we launched VITA-1.0, the first-ever open-source interactive omni-multimodal LLM. Now (2024.12.20), we bring a new version VITA-1.5!
🌟 What’s New in VITA-1.5?
We are excited to present VITA-1.5, which incorporates a series of advancements:
-
Significantly Reduced Interaction Latency. The end-to-end speech interaction latency has been reduced from about 4 seconds to 1.5 seconds, enabling near-instant interaction and greatly improving user experience.
-
Enhanced Multimodal Performance. The average performance on multimodal benchmarks such as MME, MMBench, and MathVista has been significantly increased from 59.8 to 70.8.
-
Improvement in Speech Processing. The speech processing capabilities have been refined to a new level, with ASR WER (Word Error Rate, Test Other) reduced from 18.4 to 7.5. Besides, we replace the independent TTS module of VITA-1.0 with an end-to-end TTS module, which accepts the LLM's embedding as input.
-
Progressive Training Strategy. By this manner, the adding of speech has little effect on other multi-modal performance (vision-language). The average image understanding performance only drops from 71.3 to 70.8.
📈 Experimental Results
- Evaluation on image and video understanding benchmarks.
- VITA-1.5 outperforms professional speech models on ASR benchmarks.
- Adding the audio modality has little effect on image and video understanding capability.
⭐ Training
Requirements and Installation
git clone https://github.com/VITA-MLLM/VITA
cd VITA
conda create -n vita python=3.10 -y
conda activate vita
pip install --upgrade pip
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
Data Preparation
- An example json file of the training data:
[
...
{
"set": "sharegpt4",
"id": "000000000164",
"conversations": [
{
"from": "human",
"value": "<image>\n<audio>\n"
},
{
"from": "gpt", // follow the setting of llave, "gpt" is only used to indicate that this is the ground truth of the model output
"value": "This is a well-organized kitchen with a clean, modern aesthetic. The kitchen features a white countertop against a white wall, creating a bright and airy atmosphere. "
}
],
"image": "coco/images/train2017/000000000164.jpg",
"audio": [
"new_value_dict_0717/output_wavs/f61cf238b7872b4903e1fc15dcb5a50c.wav"
]
},
...
]
- The
setfield is used to retrieve the image or video folder for data loading. You should add its key-value pair to theFolderDictin ./vita/config/dataset_config.py:
AudioFolder = ""
FolderDict = {
#### NaturalCap
"sharegpt4": "",
}
#### NaturalCap
ShareGPT4V = {"chat_path": ""}
- Set the JSON path for
"chat_path"in the corresponding dictionary in ./vita/config/dataset_config.py. - Set the audio folder path for
AudioFolderin ./vita/config/dataset_config.py. - Add the data class in
DataConfigin ./vita/config/init.py:
from .dataset_config import *
NaturalCap = [ShareGPT4V]
DataConfig = {
"Pretrain_video": NaturalCap,
}
Continual Training
-
Download the required weights: (1) VITA-1.5 checkpoint, (2) InternViT-300M-448px, and (3) Our pretrained audio encoder in Stage-2 audio-language alignment (refer to Fig. 3 in the paper).
-
Replace the paths in ./script/train/finetuneTaskNeg_qwen_nodes.sh:
...
--model_name_or_path VITA1.5_ckpt \
...
--vision_tower InternViT-300M-448px \
...
--audio_encoder audio-encoder-Qwen2-7B-1107-weight-base-11wh-tunning \
...
- Execute the following commands to start the training process:
export PYTHONPATH=./
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
OUTPUT_DIR=/mnt/cfs/lhj/videomllm_ckpt/outputs/vita_video_audio
bash script/train/finetuneTaskNeg_qwen_nodes.sh ${OUTPUT_DIR}
📐 Inference
Quick Start
- Text query
CUDA_VISIBLE_DEVICES=2 python video_audio_demo.py \
--model_path [vita/path] \
--image_path asset/vita_newlog.jpg \
--model_type qwen2p5_instruct \
--conv_mode qwen2p5_instruct \
--question "Describe this images."
- Audio query
CUDA_VISIBLE_DEVICES=4 python video_audio_demo.py \
--model_path [vita/path] \
--image_path asset/vita_newlog.png \
--model_type qwen2p5_instruct \
--conv_mode qwen2p5_instruct \
--audio_path asset/q1.wav
- Noisy audio query
CUDA_VISIBLE_DEVICES=4 python video_audio_demo.py \
--model_path [vita/path] \
--image_path asset/vita_newlog.png \
--model_type qwen2p5_instruct \
--conv_mode qwen2p5_instruct \
--audio_path asset/q2.wav
Demo
We have accelerated the model using vLLM. Since VITA has not yet been integrated into vLLM, you need to make some modifications to the vLLM code to adapt it for VITA.
conda create -n vita_demo python==3.10
conda activate vita_demo
pip install -r web_demo/web_demo_requirements.txt
# Backup a new weight file
cp -rL VITA_ckpt/ demo_VITA_ckpt/
mv demo_VITA_ckpt/config.json demo_VITA_ckpt/origin_config.json
cd ./web_demo/vllm_tools
cp -rf qwen2p5_model_weight_file/* ../../demo_VITA_ckpt/
cp -rf vllm_file/* your_anaconda/envs/vita_demo/lib/python3.10/site-packages/vllm/model_executor/models/
📍 Basic Demo
https://github.com/user-attachments/assets/43edd44a-8c8d-43ea-9d2b-beebe909377a
python -m web_demo.web_ability_demo demo_VITA_ckpt/
📍 Real-Time Interactive Demo
To run the real-time interactive demo, you need to make the following preparations:
-
Make sure that you have executed the above instructions under the Demo section (
cpfiles out from thevllm_tools). -
Prepare a VAD (Voice Activity Detection) module. You can choose to download silero_vad.onnx and silero_vad.jit, and place these files in the
./web_demo/wakeup_and_vad/resource/directory. -
For a better real-time interactive experience, you need
