VideoLLaMA2
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Install / Use
/learn @DAMO-NLP-SG/VideoLLaMA2README
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding <br> Boqiang Zhang<sup>* </sup>, Kehan Li<sup>* </sup>, Zesen Cheng<sup>* </sup>, Zhiqiang Hu<sup>* </sup>, Yuqian Yuan<sup>* </sup>, Guanzheng Chen<sup>* </sup>, Sicong Leng<sup>* </sup>, Yuming Jiang<sup>* </sup>, Hang Zhang<sup>* </sup>, Xin Li<sup>* </sup>, Peng Jin, Wenqi Zhang, Fan Wang, Lidong Bing, Deli Zhao <br>
![]()
![]()
<br>
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding <br> Hang Zhang, Xin Li, Lidong Bing <br>
![]()
![]()
<br>
VCD: Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding <br> Sicong Leng<sup>* </sup>, Hang Zhang<sup>* </sup>, Guanzheng Chen, Xin Li, Shijian Lu, Chunyan Miao, Lidong Bing <br>
![]()
![]()
<br>
The Curse of Multi-Modalities: Evaluating Hallucinations of Large Multimodal Models across Language, Visual, and Audio <br> Sicong Leng, Yun Xing, Zesen Cheng, Yang Zhou, Hang Zhang, Xin Li, Deli Zhao, Shijian Lu, Chunyan Miao, Lidong Bing <br>
![]()
![]()
<br>
</p></details> <div align="center"><video src="https://github.com/DAMO-NLP-SG/VideoLLaMA2/assets/18526640/e0e7951c-f392-42ed-afad-b2c7984d3e38" width="800"></div>Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss <br> Zesen Cheng*, Hang Zhang*, Kehan Li*, Sicong Leng, Zhiqiang Hu, Fei Wu, Deli Zhao, Xin Li, Lidong Bing <br>
![]()
![]()
<br>
📰 News
- [2025.01.21] 🚀🚀 We are excited to officially launch VideoLLaMA3, featuring enhanced performance across image and video benchmarks, along with a variety of easy-to-follow inference cookbooks. Try it out today!
- [2024.10.22] Release checkpoints of VideoLLaMA2.1-7B-AV. The audio_visual branch code can be seen here: https://github.com/DAMO-NLP-SG/VideoLLaMA2/tree/audio_visual.
- [2024.10.15] Release checkpoints of VideoLLaMA2.1-7B-16F-Base and VideoLLaMA2.1-7B-16F.
- [2024.08.14] Release checkpoints of VideoLLaMA2-72B-Base and VideoLLaMA2-72B.
- [2024.07.30] Release checkpoints of VideoLLaMA2-8x7B-Base and VideoLLaMA2-8x7B.
- [2024.06.25] 🔥🔥 As of Jun 25, our VideoLLaMA2-7B-16F is the Top-1 ~7B-sized VideoLLM on the MLVU Leaderboard.
- [2024.06.18] 🔥🔥 As of Jun 18, our VideoLLaMA2-7B-16F is the Top-1 ~7B-sized VideoLLM on the VideoMME Leaderboard.
- [2024.06.17] 👋👋 Update technical report with the latest results and the missing references. If you have works closely related to VideoLLaMA 2 but not mentioned in the paper, feel free to let us know.
- [2024.06.14] 🔥🔥 Online Demo is available.
- [2024.06.03] Release training, evaluation, and serving codes of VideoLLaMA 2.
🛠️ Requirements and Installation
Basic Dependencies:
- Python >= 3.8
- Pytorch >= 2.2.0
- CUDA Version >= 11.8
- transformers == 4.40.0 (for reproducing paper results)
- tokenizers == 0.19.1
[Online Mode] Install required packages (better for development):
git clone https://github.com/DAMO-NLP-SG/VideoLLaMA2
cd VideoLLaMA2
pip install -r requirements.txt
pip install flash-attn==2.5.8 --no-build-isolation
[Offline Mode] Install VideoLLaMA2 as a Python package (better for direct use):
git clone https://github.com/DAMO-NLP-SG/VideoLLaMA2
cd VideoLLaMA2
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install flash-attn==2.5.8 --no-build-isolation
