<div align="center"> <h2><a href="https://pllava.github.io/">PLLaVA: Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning</a></h2>

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See-Kiong Ng, Jiashi Feng

</div>

Project Page: PLLaVA

Overview

Welcome to PLLAVA!

The primary purpose of this repository is to support research and the development of prototype models. It is designed to facilitate ease of experimentation and enable a clear overview of results. Please note that this repo is currently undergoing development and reconstruction.

It's important to mention that we have not optimized the response speed of the application or the frontend logic. Our goal is to maintain simplicity, clarity, and ease of development, making it accessible for both researchers and students. If you have suggestions or want to enhance the application's performance, please feel free to contact us or contribute to the project.

We've briefly introduce our work in section PLLAVA. For more details, feel free to read our paper. Check out section Usage to start using this repo. If you felt our works interesting, please star us, your support is all we want. If you find our work helpful, feel free to cite us directly.

:fire: Updates

2024/4/24: Release:
- We are releasing our code/models/datasets.

🏖️ PLLAVA

Abstract

Vision-language pre-training (VLP) has significantly elevated performance across a range of vision-language applications. Yet, the pre-training process for video-related tasks demands an exceptionally high degree of computational and data resources. This paper investigates a straightforward, highly efficient, and resource-light approach to adapting an existing image-language pre-training model for video data. Our preliminary experiments reveal that directly fine-tuning pre-trained image-language models with multiple frames on video datasets leads to performance saturation or even a drop in caption-related tasks. Besides, it is also vulnerable to prompts and tends to provide short descriptions. We conducted a deep analysis and observed that the performance saturation and the vulnerability might be related to the dominant patches that exist in some single video patches. We then propose a simple pooling strategy to smooth the feature distribution along the temporal dimension and thus reduce the dominant impacts from some extreme tokens. The new model is termed Pooling LLaVA, or PLLaVA in short. With the proposed pooling strategy, we achieve new state-of-the-art performance on all evaluated datasets. Notably, on the recent popular Video ChatGPT benchmark, PLLaVA achieves a score of 3.48 out of 5 on average of five evaluated dimensions, which is the new state-of-the-art score on the leaderboard and is 0.31 higher than the previous SOTA results from GPT4V (IG-VLM). On the latest multi-choice benchmark MVBench, PLLaVA achieves 58.1% accuracy on average across 20 sub-tasks, which is the new state-of-the-art result and is 14.5% higher than GPT4V (IG-VLM).

SEARCHING FOR OPTIMAL POOLING STRATEGY

There are two dimensions for the pooling strategy: the spatial dimension and the temporal dimension. We empirically found that reducing the spatial dimension with a larger temporal dimension could lead to better model performance, compared to reducing the temporal dimension directly.

STATE-OF-THE-ART PERFORMANCE

We compare the performance of PLLAVA with recent popular methods over both question-answer and captioning datasets. The results are shown below.

:hammer: Usage

This section provides guidance on how to run, train, and evaluate our models.

Install

First, you will need to set up the environment and download some pre-trained weights.

This repo is built up using transformers for model construction along with accelerate for distributed training. Follow the instructions to install the needed environment.

Above all, the following environment set up is for python 3.10. If you choose to use conda for environment setup, we recommend creating the virtual environment with:

conda create -n pllava python=3.10

Firstly, install pytorch from the official website. The code runs on torch 2.2.1, cu118 or cu122. Select the version that suits your drive version.

torch                       2.2.1+cu118
torchaudio                  2.2.1+cu118
torchvision                 0.17.1+cu118

If your driver version is higher than cu121, you could probably try installing with the following scripts:

pip install -r requirements.txt

Otherwise, you would need to install a torch for your server first, then install the other packages:

pip install -r requirements.torch.txt # decide your own requirements, (this is for cu11), or install torch directly following the official website.
pip install -r requirements.no_torch.txt # install the following

Prepare the model. We prefer to have huggingface models explicitly downloaded to a MODELS directory. However, if you are familiar with huggingface-hub usage, feel free to organize the model yourself.

python python_scripts/hf.py

Here are some detailed information of the obtained models:

PLLaVA

Install / Use

README