ConvMAE
ConvMAE: Masked Convolution Meets Masked Autoencoders
Install / Use
/learn @Alpha-VL/ConvMAEREADME
Peng Gao<sup>1</sup>, Teli Ma<sup>1</sup>, Hongsheng Li<sup>2</sup>, Ziyi Lin<sup>2</sup>, Jifeng Dai<sup>3</sup>, Yu Qiao<sup>1</sup>,
<sup>1</sup> Shanghai AI Laboratory, <sup>2</sup> MMLab, CUHK, <sup>3</sup> Sensetime Research.
</div>* We change the project name from ConvMAE to MCMAE.
This repo is the official implementation of MCMAE: Masked Convolution Meets Masked Autoencoders. It currently concludes codes and models for the following tasks:
ImageNet Pretrain: See PRETRAIN.md.
ImageNet Finetune: See FINETUNE.md.
Object Detection: See DETECTION.md.
Semantic Segmentation: See SEGMENTATION.md.
Video Classification: See VideoConvMAE.
Updates
14/Mar/2023
MR-MCMAE (a.k.a. ConvMAE-v2) paper released: Mimic before Reconstruct: Enhancing Masked Autoencoders with Feature Mimicking.
15/Sep/2022
Paper accepted at NeurIPS 2022.
9/Sep/2022
ConvMAE-v2 pretrained checkpoints are released.
21/Aug/2022
Official-ConvMAE-Det which follows official ViTDet codebase is released.
08/Jun/2022
🚀FastConvMAE🚀: significantly accelerates the pretraining hours (4000 single GPU hours => 200 single GPU hours). The code is going to be released at FastConvMAE.
27/May/2022
- The supported codes for ImageNet-1K pretraining.
- The supported codes and models for semantic segmentation are provided.
20/May/2022
Update results on video classification.
16/May/2022
The supported codes and models for COCO object detection and instance segmentation are available.
11/May/2022
- Pretrained models on ImageNet-1K for ConvMAE.
- The supported codes and models for ImageNet-1K finetuning and linear probing are provided.
08/May/2022
The preprint version is public at arxiv.
Introduction
ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.
- We present the strong and efficient self-supervised framework ConvMAE, which is easy to implement but show outstanding performances on downstream tasks.
- ConvMAE naturally generates hierarchical representations and exhibit promising performances on object detection and segmentation.
- ConvMAE-Base improves the ImageNet finetuning accuracy by 1.4% compared with MAE-Base. On object detection with Mask-RCNN, ConvMAE-Base achieves 53.2 box AP and 47.1 mask AP with a 25-epoch training schedule while MAE-Base attains 50.3 box AP and 44.9 mask AP with 100 training epochs. On ADE20K with UperNet, ConvMAE-Base surpasses MAE-Base by 3.6 mIoU (48.1 vs. 51.7).

Pretrain on ImageNet-1K
The following table provides pretrained checkpoints and logs used in the paper. | | ConvMAE-Base| | :---: | :---: | | pretrained checkpoints| download | | logs | download |
The following results are for ConvMAE-v2 (pretrained for 200 epochs on ImageNet-1k). | model | pretrained checkpoints | ft. acc. on ImageNet-1k | | :---: | :---: | :---: | | ConvMAE-v2-Small | download | 83.6 | | ConvMAE-v2-Base | download | 85.7 | | ConvMAE-v2-Large | download | 86.8 | | ConvMAE-v2-Huge | download | 88.0 |
Main Results on ImageNet-1K
| Models | #Params(M) | Supervision | Encoder Ratio | Pretrain Epochs | FT acc@1(%) | LIN acc@1(%) | FT logs/weights | LIN logs/weights | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | BEiT | 88 | DALLE | 100% | 300 | 83.0 | 37.6 | - | - | | MAE | 88 | RGB | 25% | 1600 | 83.6 | 67.8 | - | - | | SimMIM | 88 | RGB | 100% | 800 | 84.0 | 56.7 | - | - | | MaskFeat | 88 | HOG | 100% | 300 | 83.6 | N/A | - | - | | data2vec | 88 | RGB | 100% | 800 | 84.2 | N/A | - | - | | ConvMAE-B | 88 | RGB | 25% | 1600 | 85.0 | 70.9 | log/weight |
Main Results on COCO
Mask R-CNN
| Models | Pretrain | Pretrain Epochs | Finetune Epochs | #Params(M)| FLOPs(T) | box AP | mask AP | logs/weights | | :---: | :---: | :---: |:---: | :---: | :---: | :---: | :---: | :---: | | Swin-B | IN21K w/ labels | 90 | 36 | 109 | 0.7 | 51.4 | 45.4 | - | | Swin-L | IN21K w/ labels | 90 | 36 | 218 | 1.1 | 52.4 | 46.2 | - | | MViTv2-B | IN21K w/ labels | 90 | 36 | 73 | 0.6 | 53.1 | 47.4 | - | | MViTv2-L | IN21K w/ labels | 90 | 36 | 239 | 1.3 | 53.6 | 47.5 | - | | Benchmarking-ViT-B | IN1K w/o labels | 1600 | 100 | 118 | 0.9 | 50.4 | 44.9 | - | | Benchmarking-ViT-L | IN1K w/o labels | 1600 | 100 | 340 | 1.9 | 53.3 | 47.2 | - | | ViTDet | IN1K w/o labels | 1600 | 100 | 111 | 0.8 | 51.2 | 45.5 | - | | MIMDet-ViT-B | IN1K w/o labels | 1600 | 36 | 127 | 1.1 | 51.5 | 46.0 | - | | MIMDet-ViT-L | IN1K w/o labels | 1600 | 36 | 345 | 2.6 | 53.3 | 47.5 | - | | ConvMAE-B | IN1K w/o lables | 1600 | 25 | 104 | 0.9 | 53.2 | 47.1 | log/weight |
Main Results on ADE20K
UperNet
| Models | Pretrain | Pretrain Epochs| Finetune Iters | #Params(M)| FLOPs(T) | mIoU | logs/weights | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | DeiT-B | IN1K w/ labels | 300 | 16K | 163 | 0.6 | 45.6 | - | | Swin-B | IN1K w/ labels | 300 | 16K | 121 | 0.3 | 48.1 | - | | MoCo V3 | IN1K | 300 | 16K | 163 | 0.6 | 47.3 | - | | DINO | IN1K | 400 | 16K | 163 | 0.6 | 47.2 | - | | BEiT | IN1K+DALLE | 1600 | 16K | 163 | 0.6 | 47.1 | - | | PeCo | IN1K | 300 | 16K | 163 | 0.6 | 46.7 | - | | CAE | IN1K+DALLE | 800 | 16K | 163 | 0.6 | 48.8 | - | | MAE | IN1K | 1600 | 16K | 163 | 0.6 | 48.1 | - | | ConvMAE-B | IN1K | 1600 | 16K | 153 | 0.6 | 51.7 | log/weight |
Main Results on Kinetics-400
| Models | Pretrain Epochs | Finetune Epochs | #Params(M) | Top1 | Top5 | logs/weights | | :---------------------: | :-------------: | :-------------------: | :--------: | :--: | :--: | :----------: | | VideoMAE-B | 200 | 100 | 87 | 77.8 | | | | VideoMAE-B | 800 | 100 | 87 | 79.4 | | | | VideoMAE-B | 1600 | 100 | 87 | 79.8 | | | | VideoMAE-B | 1600 | 100 (w/ Repeated Aug) | 87 | 80.7 | 94.7 | | | SpatioTemporalLearner-B | 800 | 150 (w/ Repeated Aug) | 87 | 81.3 | 94.9 | | | VideoConvMAE-B | 200 | 100 | 86 | 80.1 | 94.3 | Soon | | VideoConvMAE-B | 800 | 100 | 86 | 81.7 | 95.1 | Soon | | VideoConvMAE-B-MSD | 800 | 100 | 86 | 82.7 | 95.5 | Soon |
Main Results on Something-Something V2
| Models | Pretrain Epochs | Finetune Epochs | #Params(M) | Top1 | Top5 | logs/weights | | :----------------: | :-------------: | :-------------: | :--------: | :--: | :--: | :----------: | | VideoMAE-B | 200 | 40 | 87 | 66.1 | | | | VideoMAE-B | 800 | 40 | 87 | 69.3 | | | | VideoMAE-B | 2400 | 40 | 87 | 70.3 | | | | VideoConvMAE-B | 200 | 40 | 86 | 67.7 | 91.2 | Soon | | VideoConvMAE-B | 800 | 40 | 86 | 69.9 | 92.4 | Soon | | VideoConvMAE-B-MSD | 800 | 40 | 86 | 70.7 | 93.0 | Soon |
Getting Started
Prerequisites
- Linux
- Python 3.7+
- CUDA 10.2+
- GCC 5+
Training and evaluation
- See PRETRAIN.md for pretraining.
- See FINETUNE.md for pretrained model finetuning and linear probing.
- See DETECTION.md for using pretrained backbone on Mask RCNN.
- See SEGMENTATION.md for using pretrained backbone on [UperNet](https://openaccess.
Related Skills
node-connect
337.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.1kCommit, push, and open a PR
