MovieChat

MovieChat: From Dense Token to Sparse Memory for Long Video Understanding
Enxin Song*, Wenhao Chai*, Guanhong Wang*, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Xun Guo, Tian Ye, Yan Lu, Jenq-Neng Hwang, Gaoang Wang✉️
CVPR 2024.

MovieChat can handle videos with >10K frames on a 24GB graphics card. MovieChat has a 10000× advantage over other methods in terms of the average increase in GPU memory cost per frame (21.3KB/f to ~200MB/f).

<p align="center" width="100%"> <a target="_blank"><img src="src/assets/wave.gif" alt="MovieChat" style="width: 80%; min-width: 200px; display: block; margin: auto;"></a> </p> <h5 align="center"> If you like our project, please give us a star ⭐ on GitHub for the latest update.</h5>

🔢 MovieChat-1K leaderboard

Feel free to PR your new results!

| Model with Link | Comment | Breakpoint Acc | Global Acc | |-----------------------------------------------|------------ | Video-LLaMA | VideoChat | TimeChat | VideoChatGPT | MovieChat | MovieChat+ (baseline) | Long-LLaVA | Long-LLaVA | Streaming Long Video | DrVideo | ReWind | HERMES | Flash-VStream | MM-Screenplayer | VILA1.5-8B | FocusChat | llavaonevisio | Sullam | SEAL | HEM-LLM ------------------|------------|----------------| | End-to-end | 39.1 | 51.7 | | End-to-end | 46.1 | 57.8 | | CoT, ICL, train on MovieChat| 46.1 | 73.8 | | End-to-end | 48.0 | 47.6 | (baseline) | End-to-end | 48.3 | 62.3 | | End-to-end | 49.6 | 71.2 | | Eng-to-end | 54.0 | 69.6 | + Video-RAG | Eng-to-end | 54.5 | 72.9 | | Train on MovieChat | 54.9 | 90.4 | | RAG | 56.7 | 93.1 | | End-to-end | 57.2 | 87.6 | | Train on MovieChat | 57.3 | 78.6 | | Train on MovieChat | 59.6 | 96.0 | | RAG | 68.8 | 87.5 | | End-to-end | - | 40.0 | | End-to-end | - | 60.0| n-MovieChat | End-to-end | - | 79.0 | Jeoung, et al | Agent | - | 84.8 | | Train on MovieChat | - | 86.8 | | Unknown training dataset | - | 90.6 |

🔢 Evaluation of MovieChat on Existing Benchmarks

Sort in alphabetical order.

| Benchmark | Results | |-----------|---------| | ActivityNet-QA | Acc. / Score: 45.7 / 3.4 | | Charades-STA | R@1(IOU =0.3): 8.8 • R@1(IOU =0.5): 2.9 • R@1(IOU =0.7): 1.3 | | CineClipQA | Overall: 20.86/2.11 • Description: 23.67/2.41 • Intention: 30.19/2.41 • Perception: 21.80/1.97 • Temporality: 16.32/1.97 • Spaciality: 16.40/1.98 | | CVRR-ES | Average: 16.41 | | EgoSchema | Top 1 Acc: 53.5 | | EventBench | Acc: 20.33 | | InfiniBench | Global Appearance: 6.59 • Scene transition: 6.41 • Character actions: 4.51 • Temporal order: 36.99 • Local visual: 17.76 • Summarization: 0.14 • Deep context: 0.55 • Spoiler questions: 0.34 • Multiple events: 0.85 • Avg: 14.45/0.47 | | InfiniBench-Vision | Acc: 14.2 • Score: 1.2 | | LvBench | ER: 21.3 • EU: 23.1 • KIR: 25.9 • TG: 22.3 • Rea: 24.0 • Sum: 17.2 • Overall: 22.5 | | LvM-QA | Acc. / Score: 48.3 / 2.57 | | MLVU | Holistic TR: 29.5 • AR: 25.0 • VS: 2.33 • Single Detail NQA: 24.2 • ER: 24.7 • PQA: 25.8 • SSC: 3.23 • Multi Detail AO: 28.6 • AC: 22.8 • M-Avg: 25.8 • G-Avg: 2.78 | | MovieChat-1K | Global Acc. / Score: 62.3 / 3.23 • Global Acc. / Score: 48.3 / 2.57 | | MovieCORE | Acc: 20.33 • Comp: 2.90 • Depth: 2.29 • Evid: 2.14 • Coh: 2.30 • Avg: 2.23 | | MSVD-QA | Acc. / Score: 75.2 / 3.8 | | MSRVTT-QA | Acc. / Score: 52.7 / 2.6 | | NExT-QA | Acc. / Score: 49.9 / 2.7 | | QVHighlight | mAP: 11.7 • HIT @1: 16.1 | | RVS-Ego | Acc. / Score: 50.7 / 3.4 | | RVS-Movie | Acc. / Score: 36.0 / 2.3 | | Seed-Bench | Procedure Understanding: 29.82 • Action Recognition: 40.11 | | SFD | Multiple-Choice V: 8.4 • L: 16.4 • VL: 8.0 • Open-Ended V: 14.0 • L: 15.7 • VL: 11.8 | | SVBench | Dialogue SA: 20.46 • Dialogue CC: 20.05 • Dialogue LC: 27.76 • Dialogue TU: 21.81 • Dialogue IC: 22.21 • Dialogue OS: 21.89 • Streaming SA: 17.99 • Streaming CC: 16.42 • Streaming LC: 20.37 • Streaming TU: 15.77 • Streaming IC: 19.08 • Streaming OS: 17.43 | | TV-Caption | BertScore: 38.11 • CIDER: 8.43 • ROUGE-L: 12.09 • SPICE: 9.21 | | VCG Bench | CI: 2.76 • DO: 2.93 • CU: 3.01 • TU: 2.24 • CO: 2.42 • Avg: 2.67 | | VDC | Camera: 37.25/1.98 • Short: 32.55/1.59 • Background: 28.99/1.54 • Main: 31.97/1.64 • Object: 28.82/1.46 • Avg: 31.92/1.64 | | VideoMME | w/o subs: 38.2 • w/o subs (Long): 33.4 | | Video-ChatGPT | Avg: 2.67 • CI: 2.76 • DO: 2.93 • CU: 3.01 • TU: 2.24 • CO: 2.42 | | VS-Ego | Acc. / Score: 52.2 / 3.4 | | VS-Movie | Acc. / Score: 39.1 / 2.3 | | YouCook2 | C: 38.5 • M: 18.8 |

:fire: News

[2024.10.26] :keyboard: We upload MovieChat, MovieChat_OneVision, MovieChat-1K to lmms-eval.
[2024.10.26] :keyboard: We release a new version of MovieChat, which use LLaVA-OneVision as the base model instead of the original VideoLLaMA. The new version is available on MovieChat_Onevision.
[2024.6.13] :film_projector: We release the ground truth of MovieChat's test set in Hugging Face.
[2024.5.10] :film_projector: We release the raw videos of MovieChat's training set in Hugging Face.
[2024.4.29] :page_with_curl: We update the MovieChat+ paper with implementation details, technical evaluations, and dataset information.
[2024.4.25] :keyboard:We update a new version of MovieChat+. We realse the MovieChat+ code and the corresponding evaluation code. Our paper is Coming soon!
[2024.4.19] :keyboard:We update the latest source code of MovieChat to PyPI. Now you can use MovieChat by pip install Moviechat directly!
[2024.3.25] :bar_chart: We host challenge track 1 of the 4th International Workshop on Long-form Video Understanding: Towards Multimodal AI Assistant and Copilot at CVPR 2024. You can participate in the challenge and submit your results via Codalab. We will display the results on the leaderboard. For each participant, we hope you can submit your results in JSON format and report both the average running time and VRAM usage. We will use these metrics to select the most efficient method. For detailed information about the challenge, please refer to this link.
[2024.3.11] :film_projector: We release the test set of the MovieChat-1K in Hugging Face. Each video contains 3 global questions and 10 breakpoint questions.
[2024.2.27] :tada: Our paper was accepted by CVPR 2024!
[2024.2.14] :film_projector: We release the training set of the MovieChat-1K in Hugging Face. Due to copyright restrictions, we share the clip features extracted by eva_vit_g, containing 8192 frames of each video.
[2023.11.27] :page_with_curl: We update the paper with implementation details, technical evaluations, and dataset information.
[2023.11.23] :keyboard:We update the latest source code of MovieChat.
[2023.8.1] :page_with_curl: We release the paper.
[2023.7.31] :keyboard:We release eval code and instraction for short video QA on MSVD-QA, MSRVTT-QA and ActivityNet-QA.
[2023.7.29] :joystick:We release Gradio demo of MovieChat.
[2023.7.22] :keyboard:We release source code of MovieChat.

[](https://paperswithcode.com/sota/zeroshot-video-question-answer-on-a

MovieChat

Install / Use

README

MovieChat

🔢 MovieChat-1K leaderboard

🔢 Evaluation of MovieChat on Existing Benchmarks

:fire: News