VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

Generate Convert Improve

Install / Use

/learn @MCG-NJU/VideoMAE

About this skill

Quality Score

0/100

README

Official PyTorch Implementation of VideoMAE (NeurIPS 2022 Spotlight).

VideoMAE Framework

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training Zhan Tong, Yibing Song, Jue Wang, Limin Wang Nanjing University, Tencent AI Lab

📰 News

[2023.4.18] 🎈Everyone can download Kinetics-400, which is used in VideoMAE, from this link. [2023.4.18] Code and pre-trained models of VideoMAE V2 have been released! Check and enjoy this repo! [2023.4.17] We propose EVAD, an end-to-end Video Action Detection framework. [2023.2.28] Our VideoMAE V2 is accepted by CVPR 2023! 🎉 [2023.1.16] Code and pre-trained models for Action Detection in VideoMAE are available! [2022.12.27] 🎈Everyone can download extracted VideoMAE features of THUMOS, ActivityNet, HACS and FineAction from InternVideo. [2022.11.20] 👀 VideoMAE is integrated into and , supported by @Sayak Paul. [2022.10.25] 👀 VideoMAE is integrated into MMAction2, the results on Kinetics-400 can be reproduced successfully. [2022.10.20] The pre-trained models and scripts of ViT-S and ViT-H are available! [2022.10.19] The pre-trained models and scripts on UCF101 are available! [2022.9.15] VideoMAE is accepted by NeurIPS 2022 as a spotlight presentation! 🎉 [2022.8.8] 👀 VideoMAE is integrated into official 🤗HuggingFace Transformers now! [2022.7.7] We have updated new results on downstream AVA 2.2 benchmark. Please refer to our paper for details. [2022.4.24] Code and pre-trained models are available now! [2022.3.24] ~~Code and pre-trained models will be released here.~~ Welcome to watch this repository for the latest updates.

✨ Highlights

🔥 Masked Video Modeling for Video Pre-Training

VideoMAE performs the task of masked video modeling for video pre-training. We propose the extremely high masking ratio (90%-95%) and tube masking strategy to create a challenging task for self-supervised video pre-training.

⚡️ A Simple, Efficient and Strong Baseline in SSVP

VideoMAE uses the simple masked autoencoder and plain ViT backbone to perform video self-supervised learning. Due to the extremely high masking ratio, the pre-training time of VideoMAE is much shorter than contrastive learning methods (3.2x speedup). VideoMAE can serve as a simple but strong baseline for future research in self-supervised video pre-training.

😮 High performance, but NO extra data required

VideoMAE works well for video datasets of different scales and can achieve 87.4% on Kinects-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51. To our best knowledge, VideoMAE is the first to achieve the state-of-the-art performance on these four popular benchmarks with the vanilla ViT backbones while doesn't need any extra data or pre-trained models.

🚀 Main Results

✨ Something-Something V2

| Method | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Top-1 | Top-5 | | :------: | :--------: | :------: | :--------: | :---------------------: | :---: | :---: | | VideoMAE | no | ViT-S | 224x224 | 16x2x3 | 66.8 | 90.3 | | VideoMAE | no | ViT-B | 224x224 | 16x2x3 | 70.8 | 92.4 | | VideoMAE | no | ViT-L | 224x224 | 16x2x3 | 74.3 | 94.6 | | VideoMAE | no | ViT-L | 224x224 | 32x1x3 | 75.4 | 95.2 |

✨ Kinetics-400

| Method | Extra Data | Backbone | Resolution | #Frames x Clips x Crops | Top-1 | Top-5 | | :------: | :--------: | :------: | :--------: | :---------------------: | :---: | :---: | | VideoMAE | no | ViT-S | 224x224 | 16x5x3 | 79.0 | 93.8 | | VideoMAE | no | ViT-B | 224x224 | 16x5x3 | 81.5 | 95.1 | | VideoMAE | no | ViT-L | 224x224 | 16x5x3 | 85.2 | 96.8 | | VideoMAE | no | ViT-H | 224x224 | 16x5x3 | 86.6 | 97.1 | | VideoMAE | no | ViT-L | 320x320 | 32x4x3 | 86.1 | 97.3 | | VideoMAE | no | ViT-H | 320x320 | 32x4x3 | 87.4 | 97.6 |

✨ AVA 2.2

Please check the code and checkpoints in VideoMAE-Action-Detection. | Method | Extra Data | Extra Label | Backbone | #Frame x Sample Rate | mAP | | :------: | :----------: | :---------: | :------: | :------------------: | :--: | | VideoMAE | Kinetics-400 | ✗ | ViT-S | 16x4 | 22.5 | | VideoMAE | Kinetics-400 | ✓ | ViT-S | 16x4 | 28.4 | | VideoMAE | Kinetics-400 | ✗ | ViT-B | 16x4 | 26.7 | | VideoMAE | Kinetics-400 | ✓ | ViT-B | 16x4 | 31.8 | | VideoMAE | Kinetics-400 | ✗ | ViT-L | 16x4 | 34.3 | | VideoMAE | Kinetics-400 | ✓ | ViT-L | 16x4 | 37.0 | | VideoMAE | Kinetics-400 | ✗ | ViT-H | 16x4 | 36.5 | | VideoMAE | Kinetics-400 | ✓ | ViT-H | 16x4 | 39.5 | | VideoMAE | Kinetics-700 | ✗ | ViT-L | 16x4 | 36.1 | | VideoMAE | Kinetics-700 | ✓ | ViT-L | 16x4 | 39.3 |

✨ UCF101 & HMDB51

| Method | Extra Data | Backbone | UCF101 | HMDB51 | | :------: | :----------: | :------: | :----: | :----: | | VideoMAE | no | ViT-B | 91.3 | 62.6 | | VideoMAE | Kinetics-400 | ViT-B | 96.1 | 73.3 |

🔨 Installation

Please follow the instructions in INSTALL.md.

➡️ Data Preparation

Please follow the instructions in DATASET.md for data preparation.

🔄 Pre-training

The pre-training instruction is in PRETRAIN.md.

⤴️ Fine-tuning with pre-trained models

The fine-tuning instruction is in FINETUNE.md.

📍Model Zoo

We provide pre-trained and fine-tuned models in MODEL_ZOO.md.

👀 Visualization

We provide the script for visualization in vis.sh. Colab notebook for better visualization is coming soon.

☎️ Contact

Zhan Tong: tongzhan@smail.nju.edu.cn

👍 Acknowledgements

Thanks to Ziteng Gao, Lei Chen, Chongjian Ge, and Zhiyu Zhao for th

Related Skills

docs-writer

99.0k

`docs-writer` skill instructions As an expert technical writer and editor for the Gemini CLI project, you produce accurate, clear, and consistent documentation. When asked to write, edit, or revie

model-usage

334.5k

Use CodexBar CLI local cost usage to summarize per-model usage for Codex or Claude, including the current (most recent) model or a full model breakdown. Trigger when asked for model-level usage/cost data from codexbar, or when you need a scriptable per-model summary from codexbar cost JSON.

for a github pr, please respond in the following format - ## What type of PR is this? - [ ] 🍕 Feature - [ ] 🐛 Bug Fix - [ ] 📝 Documentation - [ ] 🧑‍💻 Code Refactor - [ ] 🔧 Other ## Description  ## Related Issues  ## Updated requirements or dependencies? - [ ] Requirements or dependencies added/updated/removed - [ ] No requirements changed ## Testing - [ ] Tests added/updated - [ ] No tests needed **How to test or why no tests:**  ## Checklist - [ ] Self-reviewed the code - [ ] Tests pass locally - [ ] No console errors/warnings ## [optional] What gif best describes this PR?

Design

Campus Second-Hand Trading Platform \- General Design Document (v5.0 \- React Architecture \- Complete Final Version)1\. System Overall Design 1.1. Project Overview This project aims t

MCG-NJU

View profile

View on GitHub

GitHub Stars1.7k

CategoryContent

Updated1d ago

Forks163

MCG-NJU/VideoMAE

Languages

Python

Security Score

85/100

Audited on Mar 24, 2026

No findings