<img src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/blob/e7bc34e0e9a96d77947a75b54399d9f96ccf209d/assets/logo.png" width="150" style="margin-bottom: 0.2;"/> <h3 align="center"><a href="https://arxiv.org/abs/2406.07476" style="color:#9C276A"> VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs</a></h3> <h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h2> <h5 align="center">

</h5>

<details open><summary>💡 Some other multimodal-LLM projects from our team may interest you ✨. </summary>

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding Boqiang Zhang* , Kehan Li* , Zesen Cheng* , Zhiqiang Hu* , Yuqian Yuan* , Guanzheng Chen* , Sicong Leng* , Yuming Jiang* , Hang Zhang* , Xin Li* , Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding Hang Zhang, Xin Li, Lidong Bing

VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding Sicong Leng* , Hang Zhang* , Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing

</details> <div align="center"><video src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/assets/18526640/e0e7951c-f392-42ed-afad-b2c7984d3e38" width="800"></div>

📰 News

[2025.01.21] 🚀🚀 We are excited to officially launch VideoLLaMA3, featuring enhanced performance across image and video benchmarks, along with a variety of easy-to-follow inference cookbooks. Try it out today!
[2024.10.22] Release checkpoints of VideoLLaMA2.1-7B-AV. The audio_visual branch code can be seen here: https://github.com/DAMO-NLP-SG/VideoLLaMA2/tree/audio_visual.
[2024.10.15] Release checkpoints of VideoLLaMA2.1-7B-16F-Base and VideoLLaMA2.1-7B-16F.
[2024.08.14] Release checkpoints of VideoLLaMA2-72B-Base and VideoLLaMA2-72B.
[2024.07.30] Release checkpoints of VideoLLaMA2-8x7B-Base and VideoLLaMA2-8x7B.
[2024.06.25] 🔥🔥 As of Jun 25, our VideoLLaMA2-7B-16F is the Top-1 ~7B-sized VideoLLM on the MLVU Leaderboard.
[2024.06.18] 🔥🔥 As of Jun 18, our VideoLLaMA2-7B-16F is the Top-1 ~7B-sized VideoLLM on the VideoMME Leaderboard.
[2024.06.17] 👋👋 Update technical report with the latest results and the missing references. If you have works closely related to VideoLLaMA 2 but not mentioned in the paper, feel free to let us know.
[2024.06.14] 🔥🔥 Online Demo is available.
[2024.06.03] Release training, evaluation, and serving codes of VideoLLaMA 2.

🛠️ Requirements and Installation

Basic Dependencies:

Python >= 3.8
Pytorch >= 2.2.0
CUDA Version >= 11.8
transformers == 4.40.0 (for reproducing paper results)
tokenizers == 0.19.1

[Online Mode] Install required packages (better for development):

git clone https://github.com/DAMO-NLP-SG/VideoLLaMA2
cd VideoLLaMA2
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation

[Offline Mode] Install VideoLLaMA2 as a Python package (better for direct use):

git clone https://github.com/DAMO-NLP-SG/VideoLLaMA2
cd VideoLLaMA2
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install flash-attn==2.5.8 --no-build-isolation

🚀 Main Results

Multi-Choice Video QA & Video Captioning

<img src="https://github.com/user-attachments/

VideoLLaMA2

Install / Use

README

📰 News

🛠️ Requirements and Installation

🚀 Main Results

Multi-Choice Video QA & Video Captioning