SkillAgentSearch skills...

VideoLLaMA3

Frontier Multimodal Foundation Models for Image and Video Understanding

Install / Use

/learn @DAMO-NLP-SG/VideoLLaMA3
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <img src="https://github.com/DAMO-NLP-SG/VideoLLaMA3/blob/main/assets/logo.png?raw=true" width="150" style="margin-bottom: 0.2;"/> <p> <h3 align="center"><a href="https://arxiv.org/pdf/2501.13106" style="color:#9C276A"> VideoLLaMA 3: Frontier Multimodal Foundation Models for Video Understanding</a></h3> <h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h2> <h5 align="center">

hf_space hf_space hf_checkpoint <br> License Hits GitHub issues GitHub closed issues <br> hf_paper arXiv

</h5> <details open><summary>💡 Some other multimodal-LLM projects from our team may interest you ✨. </summary><p> <!-- may -->

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs <br> Zesen Cheng*, Sicong Leng*, Hang Zhang*, Yifei Xin*, Xin Li*, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, Lidong Bing <br> github github arXiv <be>

VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM <br> Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, Jianke Zhu, Lidong Bing <br> github github arXiv <br>

VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding <br> Sicong Leng*, Hang Zhang*, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing <br> github github arXiv <br>

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio <br> Sicong Leng*, Yun Xing*, Zesen Cheng*, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing <br> github github arXiv <br>

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss <br> Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing <br> github github arXiv <br>

</p></details>

📰 News

  • [2025.02.07] 🔥🔥 Release our re-captioned high-quality image-text dataset VL3-Syn7M.
  • [2025.01.26] 🔥🔥 As of Jan 26, VideoLLaMA3-7B is the best 7B-sized model on LVBench leaderboard.
  • [2025.01.24] 🔥🔥 As of Jan 24, VideoLLaMA3-7B is the best 7B-sized model on VideoMME leaderboard.
  • [2025.01.22] 👋👋 Release technical report of VideoLLaMA 3. If you have works closely related to VideoLLaMA 3 but not mentioned in the paper, feel free to let us know.
  • [2025.01.21] Release models and inference code of VideoLLaMA 3.

🌟 Introduction

VideoLLaMA 3 is a series of multimodal foundation models with frontier image and video understanding capacity.

<img src="assets/performance.png" style="max-width: 100%; height: auto;"> <details> <summary>💡Click here to show detailed performance on video benchmarks</summary> <img src="https://github.com/user-attachments/assets/118e7a56-0c3e-4132-b0b5-f516d0654338" style="max-width: 100%; height: auto;"> <img src="https://github.com/user-attachments/assets/3524cefe-01d3-4031-8620-f85dc38e3d02" style="max-width: 100%; height: auto;"> </details> <details> <summary>💡Click here to show detailed performance on image benchmarks</summary> <img src="assets/results_image_2b.png" style="max-width: 100%; height: auto;"> <img src="assets/results_image_7b.png" style="max-width: 100%; height: auto;"> </details>

🛠️ Requirements and Installation

Basic Dependencies:

  • Python >= 3.10
  • Pytorch >= 2.4.0
  • CUDA Version >= 11.8
  • transformers >= 4.46.3

Install required packages:

[Inference-only]

For stable inference, install the following package versions:

# PyTorch and torchvision for CUDA 11.8
pip install torch==2.4.0 torchvision==0.19.0 --extra-index-url https://download.pytorch.org/whl/cu118

# Flash-attn pinned to a compatible version
pip install flash-attn==2.7.3 --no-build-isolation --upgrade

# Transformers and accelerate
pip install transformers==4.46.3 accelerate==1.0.1

# Video processing dependencies
pip install decord ffmpeg-python imageio opencv-python

Note: For CUDA 11.8 with torch==2.4.0 and torchvision==0.19.0, use flash-attn==2.7.3.
If you are using a different Python or CUDA version, please check the flash-attn releases to select the compatible wheel. Using incompatible versions may break the setup.

[Training]

git clone https://github.com/DAMO-NLP-SG/VideoLLaMA3
cd VideoLLaMA3
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

:earth_americas: Model Zoo

| Model | Base Model | HF Link | | -------------------- | ------------ | ------------------------------------------------------------ | | VideoLLaMA3-7B | Qwen2.5-7B | DAMO-NLP-SG/VideoLLaMA3-7B | | VideoLLaMA3-2B | Qwen2.5-1.5B | DAMO-NLP-SG/VideoLLaMA3-2B | | VideoLLaMA3-7B-Image | Qwen2.5-7B | DAMO-NLP-SG/VideoLLaMA3-7B-Image | | VideoLLaMA3-2B-Image | Qwen2.5-1.5B | DAMO-NLP-SG/VideoLLaMA3-2B-Image |

We also upload the tuned vision encoder of VideoLLaMA3-7B for wider application:

| Model | Base Model | HF Link | | ----------------------------- | ------------------------- | ------------------------------------------------------------ | | VideoLLaMA3-7B Vision Encoder | siglip-so400m-patch14-384 | DAMO-NLP-SG/VL3-SigLIP-NaViT |

🤖 Inference

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

device = "cuda:0"
model_path = "DAMO-NLP-SG/VideoLLaMA3-7B"
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    trust_remote_code=True,
    device_map={"": device},
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)

conversation = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {"type": "video", "video": {"video_path": "./assets/cat_and_chicken.mp4", "fps": 1, "max_frames": 180}},
            {"type": "text", "text": "What is the cat doing?"},
        ]
    },
]

inputs = processor(
    conversation=conversation,
    add_system_prompt=True,
    add_g

Related Skills

View on GitHub
GitHub Stars1.1k
CategoryContent
Updated1d ago
Forks88

Languages

Jupyter Notebook

Security Score

95/100

Audited on Apr 6, 2026

No findings