SkillAgentSearch skills...

VideoLLaMA2

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Install / Use

/learn @DAMO-NLP-SG/VideoLLaMA2
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<p align="center"> <img src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/blob/e7bc34e0e9a96d77947a75b54399d9f96ccf209d/assets/logo.png" width="150" style="margin-bottom: 0.2;"/> <p> <h3 align="center"><a href="https://arxiv.org/abs/2406.07476" style="color:#9C276A"> VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs</a></h3> <h5 align="center"> If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏 </h2> <h5 align="center">

hf_space hf_space hf_checkpoint hf_data <br> License Hits GitHub issues GitHub closed issues <br> hf_paper arXiv <br>

</h5>

PWC <br> PWC <br> PWC <br> PWC <br> PWC <br>

<details open><summary>💡 Some other multimodal-LLM projects from our team may interest you ✨. </summary><p> <!-- may -->

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding <br> Boqiang Zhang<sup>* </sup>, Kehan Li<sup>* </sup>, Zesen Cheng<sup>* </sup>, Zhiqiang Hu<sup>* </sup>, Yuqian Yuan<sup>* </sup>, Guanzheng Chen<sup>* </sup>, Sicong Leng<sup>* </sup>, Yuming Jiang<sup>* </sup>, Hang Zhang<sup>* </sup>, Xin Li<sup>* </sup>, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao <br> github github arXiv <br>

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding <br> Hang Zhang, Xin Li, Lidong Bing <br> github github arXiv <br>

VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding <br> Sicong Leng<sup>* </sup>, Hang Zhang<sup>* </sup>, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing <br> github github arXiv <br>

The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio <br> Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing <br> github github arXiv <br>

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss <br> Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing <br> github github arXiv <br>

</p></details> <div align="center"><video src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/assets/18526640/e0e7951c-f392-42ed-afad-b2c7984d3e38" width="800"></div>

📰 News

  • [2025.01.21] 🚀🚀 We are excited to officially launch VideoLLaMA3, featuring enhanced performance across image and video benchmarks, along with a variety of easy-to-follow inference cookbooks. Try it out today!
  • [2024.10.22] Release checkpoints of VideoLLaMA2.1-7B-AV. The audio_visual branch code can be seen here: https://github.com/DAMO-NLP-SG/VideoLLaMA2/tree/audio_visual.
  • [2024.10.15] Release checkpoints of VideoLLaMA2.1-7B-16F-Base and VideoLLaMA2.1-7B-16F.
  • [2024.08.14] Release checkpoints of VideoLLaMA2-72B-Base and VideoLLaMA2-72B.
  • [2024.07.30] Release checkpoints of VideoLLaMA2-8x7B-Base and VideoLLaMA2-8x7B.
  • [2024.06.25] 🔥🔥 As of Jun 25, our VideoLLaMA2-7B-16F is the Top-1 ~7B-sized VideoLLM on the MLVU Leaderboard.
  • [2024.06.18] 🔥🔥 As of Jun 18, our VideoLLaMA2-7B-16F is the Top-1 ~7B-sized VideoLLM on the VideoMME Leaderboard.
  • [2024.06.17] 👋👋 Update technical report with the latest results and the missing references. If you have works closely related to VideoLLaMA 2 but not mentioned in the paper, feel free to let us know.
  • [2024.06.14] 🔥🔥 Online Demo is available.
  • [2024.06.03] Release training, evaluation, and serving codes of VideoLLaMA 2.
<img src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/assets/18526640/b9faf24f-bdd2-4728-9385-acea17ea086d" width="800" />

🛠️ Requirements and Installation

Basic Dependencies:

  • Python >= 3.8
  • Pytorch >= 2.2.0
  • CUDA Version >= 11.8
  • transformers == 4.40.0 (for reproducing paper results)
  • tokenizers == 0.19.1

[Online Mode] Install required packages (better for development):

git clone https://github.com/DAMO-NLP-SG/VideoLLaMA2
cd VideoLLaMA2
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation

[Offline Mode] Install VideoLLaMA2 as a Python package (better for direct use):

git clone https://github.com/DAMO-NLP-SG/VideoLLaMA2
cd VideoLLaMA2
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install flash-attn==2.5.8 --no-build-isolation

🚀 Main Results

Multi-Choice Video QA & Video Captioning

<p><img src="https://github.com/user-attachments/
View on GitHub
GitHub Stars1.3k
CategoryContent
Updated4d ago
Forks87

Languages

Python

Security Score

95/100

Audited on Mar 30, 2026

No findings