VideoLLaMA3
Frontier Multimodal Foundation Models for Image and Video Understanding
Install / Use
/learn @DAMO-NLP-SG/VideoLLaMA3README
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs <br> Zesen Cheng*, Sicong Leng*, Hang Zhang*, Yifei Xin*, Xin Li*, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing <br>
![]()
![]()
<be>
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM <br> Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing <br>
![]()
![]()
<br>
VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding <br> Sicong Leng*, Hang Zhang*, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing <br>
![]()
![]()
<br>
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio <br> Sicong Leng*, Yun Xing*, Zesen Cheng*, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing <br>
![]()
![]()
<br>
</p></details>Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss <br> Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing <br>
![]()
![]()
<br>
📰 News
- [2025.02.07] 🔥🔥 Release our re-captioned high-quality image-text dataset VL3-Syn7M.
- [2025.01.26] 🔥🔥 As of Jan 26, VideoLLaMA3-7B is the best 7B-sized model on LVBench leaderboard.
- [2025.01.24] 🔥🔥 As of Jan 24, VideoLLaMA3-7B is the best 7B-sized model on VideoMME leaderboard.
- [2025.01.22] 👋👋 Release technical report of VideoLLaMA 3. If you have works closely related to VideoLLaMA 3 but not mentioned in the paper, feel free to let us know.
- [2025.01.21] Release models and inference code of VideoLLaMA 3.
🌟 Introduction
VideoLLaMA 3 is a series of multimodal foundation models with frontier image and video understanding capacity.
<img src="assets/performance.png" style="max-width: 100%; height: auto;"> <details> <summary>💡Click here to show detailed performance on video benchmarks</summary> <img src="https://github.com/user-attachments/assets/118e7a56-0c3e-4132-b0b5-f516d0654338" style="max-width: 100%; height: auto;"> <img src="https://github.com/user-attachments/assets/3524cefe-01d3-4031-8620-f85dc38e3d02" style="max-width: 100%; height: auto;"> </details> <details> <summary>💡Click here to show detailed performance on image benchmarks</summary> <img src="assets/results_image_2b.png" style="max-width: 100%; height: auto;"> <img src="assets/results_image_7b.png" style="max-width: 100%; height: auto;"> </details>🛠️ Requirements and Installation
Basic Dependencies:
- Python >= 3.10
- Pytorch >= 2.4.0
- CUDA Version >= 11.8
- transformers >= 4.46.3
Install required packages:
[Inference-only]
For stable inference, install the following package versions:
# PyTorch and torchvision for CUDA 11.8
pip install torch==2.4.0 torchvision==0.19.0 --extra-index-url https://download.pytorch.org/whl/cu118
# Flash-attn pinned to a compatible version
pip install flash-attn==2.7.3 --no-build-isolation --upgrade
# Transformers and accelerate
pip install transformers==4.46.3 accelerate==1.0.1
# Video processing dependencies
pip install decord ffmpeg-python imageio opencv-python
⚠ Note: For CUDA 11.8 with
torch==2.4.0andtorchvision==0.19.0, useflash-attn==2.7.3.
If you are using a different Python or CUDA version, please check the flash-attn releases to select the compatible wheel. Using incompatible versions may break the setup.
[Training]
git clone https://github.com/DAMO-NLP-SG/VideoLLaMA3
cd VideoLLaMA3
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
:earth_americas: Model Zoo
| Model | Base Model | HF Link | | -------------------- | ------------ | ------------------------------------------------------------ | | VideoLLaMA3-7B | Qwen2.5-7B | DAMO-NLP-SG/VideoLLaMA3-7B | | VideoLLaMA3-2B | Qwen2.5-1.5B | DAMO-NLP-SG/VideoLLaMA3-2B | | VideoLLaMA3-7B-Image | Qwen2.5-7B | DAMO-NLP-SG/VideoLLaMA3-7B-Image | | VideoLLaMA3-2B-Image | Qwen2.5-1.5B | DAMO-NLP-SG/VideoLLaMA3-2B-Image |
We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:
| Model | Base Model | HF Link | | ----------------------------- | ------------------------- | ------------------------------------------------------------ | | VideoLLaMA3-7B Vision Encoder | siglip-so400m-patch14-384 | DAMO-NLP-SG/VL3-SigLIP-NaViT |
🤖 Inference
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
device = "cuda:0"
model_path = "DAMO-NLP-SG/VideoLLaMA3-7B"
model = AutoModelForCausalLM.from_pretrained(
model_path,
trust_remote_code=True,
device_map={"": device},
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
conversation = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{"type": "video", "video": {"video_path": "./assets/cat_and_chicken.mp4", "fps": 1, "max_frames": 180}},
{"type": "text", "text": "What is the cat doing?"},
]
},
]
inputs = processor(
conversation=conversation,
add_system_prompt=True,
add_g
Related Skills
qqbot-channel
351.8kQQ 频道管理技能。查询频道列表、子频道、成员、发帖、公告、日程等操作。使用 qqbot_channel_api 工具代理 QQ 开放平台 HTTP 接口,自动处理 Token 鉴权。当用户需要查看频道、管理子频道、查询成员、发布帖子/公告/日程时使用。
docs-writer
100.6k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
351.8kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
project-overview
FlightPHP Skeleton Project Instructions This document provides guidelines and best practices for structuring and developing a project using the FlightPHP framework. Instructions for AI Coding A
