CAE

This is a PyTorch implementation of “Context AutoEncoder for Self-Supervised Representation Learning"

Generate Convert Improve

Install / Use

/learn @lxtGH/CAE

About this skill

Quality Score

0/100

README

CAE: Context AutoEncoder for Self-Supervised Representation Learning

This is a PyTorch implementation of CAE: Context AutoEncoder for Self-Supervised Representation Learning.

Highlights

State-of-the-art MIM performance. Results in the paper are successfully reproduced.

Installation

Clone the repo and install required packages.

pip install -r requirements.txt

# install apex
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Data Preparation

First, download ImageNet-1k from http://image-net.org/.

The directory structure is the standard layout of torchvision's datasets.ImageFolder. The training and validation data are expected to be in the train/ folder and val folder, respectively:

/path/to/imagenet/
  train/
    class1/
      img1.jpeg
    class2/
      img2.jpeg
  val/
    class1/
      img3.jpeg
    class/2
      img4.jpeg

Second, download the pretrained tokenizer.

TOKENIZER_PATH=/path/to/save/dall_e_tokenizer_weight
mkdir -p $TOKENIZER_PATH
wget -o $TOKENIZER_PATH/encoder.pkl https://cdn.openai.com/dall-e/encoder.pkl
wget -o $TOKENIZER_PATH/decoder.pkl https://cdn.openai.com/dall-e/decoder.pkl

Pretraining

Here is an example that pretrains CAE-base on ImageNet-1K with 32 GPUs. Please see scripts/cae_base_800e.sh for complete script.

OMP_NUM_THREADS=1 $PYTHON -m torch.distributed.launch \
  --nproc_per_node=8 \
  tools/run_pretraining.py \
  --data_path ${DATA_PATH} \
  --output_dir ${OUTPUT_DIR} \
  --model cae_base_patch16_224_8k_vocab --discrete_vae_weight_path ${TOKENIZER_PATH} \
  --batch_size 64 --lr 1.5e-3 --warmup_epochs 20 --epochs 800 \
  --clip_grad 3.0 --layer_scale_init_value 0.1 \
  --imagenet_default_mean_and_std \
  --color_jitter 0 \
  --drop_path 0.1 \
  --sincos_pos_emb \
  --mask_generator block \
  --num_mask_patches 98 \
  --decoder_layer_scale_init_value 0.1 \
  --no_auto_resume \
  --save_ckpt_freq 100 \
  --exp_name $my_name \
  --regressor_depth 4 \
  --decoder_depth 4 \
  --align_loss_weight 2

--num_mask_patches: number of the input patches need be masked.
--batch_size: batch size per GPU.
Effective batch size = number of GPUs * --batch_size. So in the above example, the effective batch size is 64*32 = 2048.
--lr: learning rate.
--warmup_epochs: learning rate warmup epochs. Warm up [10, 20, 40] epochs for [300, 800, 1600] pretrain epochs respectively.
--epochs: total pretraining epochs.
--clip_grad: clip gradient norm.
--drop_path: stochastic depth rate.
--imagenet_default_mean_and_std: enable this for ImageNet-1k pretraining, i.e., (0.485, 0.456, 0.406) for mean and (0.229, 0.224, 0.225) for std. For other pretraining data, use (0.5, 0.5, 0.5) for mean and (0.5, 0.5, 0.5) for std by default.
--layer_scale_init_value: 0.1 for base, 1e-5 for large, set 0 to disable layerscale. We set --decoder_layer_scale_init_value the same as this.
--sincos_pos_emb: adopt sin-cos positional embedding during pretraining.
--regressor_depth: length of the regressor.
--decoder_depth: length of the decoder.
--align_loss_weight: weight for alignment loss. 2 by default.

Warmup epochs for 300/800/1600 epochs pretraining are 10/20/40.

For CAE-large, please refer to scripts/cae_large_1600e.sh.

Results

Here provides the results of CAE-base/CAE-large for these evaluation tasks:

Linear probing
Attentive probing
Fine-tuning
Semantic segmentation
Object detection and instance segmentation

Pretrained weights and logs are available (Google Drive, Baidu Cloud [Code: 4kil]). *: from CAE paper.

| Model | Pretraining data | #Epoch | Linear | Attentive | Fine-tuning | ADE Seg | COCO Det | COCO InstSeg | | ---------- | ---------------- | ------ | ------ | --------- | ----------- | ------- | -------- | ------------ | | MAE-base* | ImageNet-1K | 1600 | 67.8 | 74.2 | 83.6 | 48.1 | 48.4 | 42.6 | | MAE-large* | ImageNet-1K | 1600 | 76.0 | 78.8 | 86.0 | 53.6 | 54.0 | 47.1 | | CAE-base | ImageNet-1K | 300 | 64.5 | 74.0 | 83.6 | 48.1 | 48.3 | 42.7 | | CAE-base | ImageNet-1K | 800 | 68.9 | 75.9 | 83.8 | 49.7 | 49.9 | 43.9 | | CAE-base | ImageNet-1K | 1600 | 70.3 | 77.2 | 83.9 | 50.3 | 50.3 | 44.2 | | CAE-large | ImageNet-1K | 1600 | 77.8 | 81.2 | 86.2 | 54.9 | 54.5 | 47.5 |

Linear Probing

Please refer to scripts/cae_base_800e.sh (32 GPUs).
For CAE-large, just replace --model cae_base_patch16_224 with --model cae_large_patch16_224.

Attentive Probing

Please refer to scripts/cae_base_800e.sh (32 GPUs).
For CAE-large, just replace --model cae_base_patch16_224 with --model cae_large_patch16_224.

Fine-tuning

Please refer to scripts/cae_base_finetune.sh (32 GPUs).
For CAE-large, please refer to scripts/cae_large_finetune.sh (32 GPUs).

Segmentation & Detection

Please refer to downstream_tasks dir to get started.

Acknowledgement

This repository is built using the BEiT and MMSelfSup, thanks for their open-source code! Thanks also to the CAE authors for their excellent work!

Citation

@article{ContextAutoencoder2022,
  title={Context Autoencoder for Self-Supervised Representation Learning},
  author={Chen, Xiaokang and Ding, Mingyu and Wang, Xiaodi and Xin, Ying and Mo, Shentong and Wang, Yunhao and Han, Shumin and Luo, Ping and Zeng, Gang and Wang, Jingdong},
  journal={arXiv preprint arXiv:2202.03026},
  year={2022}
}

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

best-practices-researcher

The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

mentoring-juniors

Community-contributed instructions, agents, skills, and configurations to help you make the most of GitHub Copilot.

groundhog

399

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

lxtGH

View profile

View on GitHub

GitHub Stars197

CategoryEducation

Updated21d ago

Forks21

lxtGH/CAE

Languages

Python

Security Score

85/100

Audited on Mar 5, 2026

No findings