VideoMAE
[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Install / Use
/learn @MCG-NJU/VideoMAEREADME
Official PyTorch Implementation of VideoMAE (NeurIPS 2022 Spotlight).

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training<br> Zhan Tong, Yibing Song, Jue Wang, Limin Wang<br>Nanjing University, Tencent AI Lab
📰 News
[2023.4.18] 🎈Everyone can download Kinetics-400, which is used in VideoMAE, from this link.<br>
[2023.4.18] Code and pre-trained models of VideoMAE V2 have been released! Check and enjoy this repo!<br>
[2023.4.17] We propose EVAD, an end-to-end Video Action Detection framework.<br>
[2023.2.28] Our VideoMAE V2 is accepted by CVPR 2023! 🎉<br>
[2023.1.16] Code and pre-trained models for Action Detection in VideoMAE are available! <br>
[2022.12.27] 🎈Everyone can download extracted VideoMAE features of THUMOS, ActivityNet, HACS and FineAction from InternVideo.<br>
[2022.11.20] 👀 VideoMAE is integrated into and
, supported by @Sayak Paul.<br>
[2022.10.25] 👀 VideoMAE is integrated into MMAction2, the results on Kinetics-400 can be reproduced successfully. <br>
[2022.10.20] The pre-trained models and scripts of ViT-S and ViT-H are available! <br>
[2022.10.19] The pre-trained models and scripts on UCF101 are available! <br>
[2022.9.15] VideoMAE is accepted by NeurIPS 2022 as a spotlight presentation! 🎉 <br>
[2022.8.8] 👀 VideoMAE is integrated into official 🤗HuggingFace Transformers now!
<br>
[2022.7.7] We have updated new results on downstream AVA 2.2 benchmark. Please refer to our paper for details. <br>
[2022.4.24] Code and pre-trained models are available now! <br>
[2022.3.24] ~~Code and pre-trained models will be released here.~~ Welcome to watch this repository for the latest updates.
✨ Highlights
🔥 Masked Video Modeling for Video Pre-Training
VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.
⚡️ A Simple, Efficient and Strong Baseline in SSVP
VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.
😮 High performance, but NO extra data required
VideoMAE works well for video datasets of different scales and can achieve 87.4% on Kinects-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.
🚀 Main Results
✨ Something-Something V2
| Method | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Top-1 | Top-5 | | :------: | :--------: | :------: | :--------: | :---------------------: | :---: | :---: | | VideoMAE | no | ViT-S | 224x224 | 16x2x3 | 66.8 | 90.3 | | VideoMAE | no | ViT-B | 224x224 | 16x2x3 | 70.8 | 92.4 | | VideoMAE | no | ViT-L | 224x224 | 16x2x3 | 74.3 | 94.6 | | VideoMAE | no | ViT-L | 224x224 | 32x1x3 | 75.4 | 95.2 |
✨ Kinetics-400
| Method | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Top-1 | Top-5 | | :------: | :--------: | :------: | :--------: | :---------------------: | :---: | :---: | | VideoMAE | no | ViT-S | 224x224 | 16x5x3 | 79.0 | 93.8 | | VideoMAE | no | ViT-B | 224x224 | 16x5x3 | 81.5 | 95.1 | | VideoMAE | no | ViT-L | 224x224 | 16x5x3 | 85.2 | 96.8 | | VideoMAE | no | ViT-H | 224x224 | 16x5x3 | 86.6 | 97.1 | | VideoMAE | no | ViT-L | 320x320 | 32x4x3 | 86.1 | 97.3 | | VideoMAE | no | ViT-H | 320x320 | 32x4x3 | 87.4 | 97.6 |
✨ AVA 2.2
Please check the code and checkpoints in VideoMAE-Action-Detection. | Method | Extra Data | Extra Label | Backbone | #Frame x Sample Rate | mAP | | :------: | :----------: | :---------: | :------: | :------------------: | :--: | | VideoMAE | Kinetics-400 | ✗ | ViT-S | 16x4 | 22.5 | | VideoMAE | Kinetics-400 | ✓ | ViT-S | 16x4 | 28.4 | | VideoMAE | Kinetics-400 | ✗ | ViT-B | 16x4 | 26.7 | | VideoMAE | Kinetics-400 | ✓ | ViT-B | 16x4 | 31.8 | | VideoMAE | Kinetics-400 | ✗ | ViT-L | 16x4 | 34.3 | | VideoMAE | Kinetics-400 | ✓ | ViT-L | 16x4 | 37.0 | | VideoMAE | Kinetics-400 | ✗ | ViT-H | 16x4 | 36.5 | | VideoMAE | Kinetics-400 | ✓ | ViT-H | 16x4 | 39.5 | | VideoMAE | Kinetics-700 | ✗ | ViT-L | 16x4 | 36.1 | | VideoMAE | Kinetics-700 | ✓ | ViT-L | 16x4 | 39.3 |
✨ UCF101 & HMDB51
| Method | Extra Data | Backbone | UCF101 | HMDB51 | | :------: | :----------: | :------: | :----: | :----: | | VideoMAE | no | ViT-B | 91.3 | 62.6 | | VideoMAE | Kinetics-400 | ViT-B | 96.1 | 73.3 |
🔨 Installation
Please follow the instructions in INSTALL.md.
➡️ Data Preparation
Please follow the instructions in DATASET.md for data preparation.
🔄 Pre-training
The pre-training instruction is in PRETRAIN.md.
⤴️ Fine-tuning with pre-trained models
The fine-tuning instruction is in FINETUNE.md.
📍Model Zoo
We provide pre-trained and fine-tuned models in MODEL_ZOO.md.
👀 Visualization
We provide the script for visualization in vis.sh. Colab notebook for better visualization is coming soon.
☎️ Contact
Zhan Tong: tongzhan@smail.nju.edu.cn
👍 Acknowledgements
Thanks to Ziteng Gao, Lei Chen, Chongjian Ge, and Zhiyu Zhao for th
Related Skills
docs-writer
99.0k`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie
model-usage
334.5kUse CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.
pr
for a github pr, please respond in the following format - ## What type of PR is this? - [ ] 🍕 Feature - [ ] 🐛 Bug Fix - [ ] 📝 Documentation - [ ] 🧑💻 Code Refactor - [ ] 🔧 Other ## Description <!-- What changed and why? Optional: include screenshots or other supporting artifacts. --> ## Related Issues <!-- Link issues like: Fixes #123 --> ## Updated requirements or dependencies? - [ ] Requirements or dependencies added/updated/removed - [ ] No requirements changed ## Testing - [ ] Tests added/updated - [ ] No tests needed **How to test or why no tests:** <!-- Describe test steps or explain why tests aren't needed --> ## Checklist - [ ] Self-reviewed the code - [ ] Tests pass locally - [ ] No console errors/warnings ## [optional] What gif best describes this PR?
Design
Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t
