ITPN

(CVPR2023/TPAMI2024) Integrally Pre-Trained Transformer Pyramid Networks -- A Hierarchical Vision Transformer for Masked Image Modeling

Generate Convert Improve

Install / Use

/learn @sunsmarterjie/ITPN

About this skill

Quality Score

0/100

README

Integrally Pre-Trained Transformer Pyramid Networks

Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration

[CVPR2023/TPAMI2024]

(A Simple Hierarchical Vision Transformer Meets Masked Image Modeling)

<img src="assets/framework.png" alt="iTPN" width="90%"> Figure 1: The comparison between a conventional pre-training (left) and the proposed integral pre-training framework (right). We use a feature pyramid as the unified neck module and apply masked feature modeling for pre-training the feature pyramid. The green and red blocks indicate that the network weights are pre-trained and un-trained (i.e., randomly initialized for fine-tuning), respectively. </div>

Updates

11/Jul./2024

Fast-iTPN is accepted by TPAMI2024.

08/Jan./2024

Fast-iTPN is public at arxiv. Fast-iTPN is a more powerful version of iTPN.

26/Dec./2023

| model | Para. (M) | Pre-train | teacher | input/patch | 21K ft? | Acc on IN.1K | checkpoint | checkpoint (21K)| | :---: | :---: |:---: |:---: |:---: |:---: | :---: | :---: | :---: | | Fast-iTPN-T| 24 |IN.1K |CLIP-L|224/16|N|85.1%|baidu/google | | | Fast-iTPN-T| 24 |IN.1K |CLIP-L|384/16|N|86.2%||| | Fast-iTPN-T| 24 |IN.1K |CLIP-L|512/16|N|86.5%||| | Fast-iTPN-S| 40 |IN.1K |CLIP-L|224/16|N|86.4%|baidu/google | | | Fast-iTPN-S| 40 |IN.1K |CLIP-L|384/16|N|86.95%| | | | Fast-iTPN-S| 40 |IN.1K |CLIP-L|512/16|N|87.8%| | | | Fast-iTPN-B| 85 | IN.1K |CLIP-L|224/16|N|87.4%|baidu/google | | | Fast-iTPN-B| 85 | IN.1K |CLIP-L|512/16|N|88.5%| | | | Fast-iTPN-B| 85 | IN.1K |CLIP-L|512/16|Y|88.75%| | baidu/google | | Fast-iTPN-L| 312 |IN.1K |CLIP-L|640/16|N|89.5%|baidu/google | |

All the pre-trained Fast-iTPN models are available now (passward: itpn) ! The tiny/small/base scale models report the best performance on ImageNet-1K as far as we know. Use them for your own tasks! See Details.

30/May/2023

| model | Pre-train | teacher | input/patch | 21K ft? | Acc on IN.1K | | :---: | :---: |:---: |:---: |:---: |:---: | | EVA-02-B | IN.21K |EVA-CLIP-g|196/14|N|87.0%| | EVA-02-B | IN.21K |EVA-CLIP-g|448/14|N|88.3%| | EVA-02-B | IN.21K |EVA-CLIP-g|448/14|Y|88.6%| | Fast-iTPN-B|IN.1K |CLIP-L|224/16|N|87.4%| | Fast-iTPN-B|IN.1K |CLIP-L|512/16|N|88.5%| | Fast-iTPN-B|IN.1K |CLIP-L|512/16|Y|88.7%|

All the models above are only pre-trained on ImageNet-1K and these models will be available soon.

29/May/2023

The iTPN-L-CLIP/16 intermediate fine-tuned model is available (password:itpn) pretrained on 21K, and fine-tuned on 1K. Evaluating the latter one on ImageNet-1K obtains 89.2% accuracy.

28/Feb./2023

iTPN is accepted by CVPR2023!

08/Feb./2023

The iTPN-L-CLIP/16 model reaches 89.2% fine-tuning performance on ImageNet-1K.

configurations: intermediate fine-tuning on ImageNet-21K + 384 input size

21/Jan./2023

Our HiViT is accepted by ICLR2023!

HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer

08/Dec./2022

25/Nov./2022

The preprint version is public at arxiv.

Requiments

Ubuntu
Python 3.7+
CUDA 10.2+
GCC 5+
Pytorch 1.7+

Dataset

ImageNet-1K
COCO2017
ADE20K

Get Started

Prepare the environment:

conda create --name itpn python=3.8 -y
conda activate itpn

git clone git@github.com:sunsmarterjie/iTPN.git
cd iTPN

pip install torch==1.7.1+cu10.2 torchvision==0.8.2+cu10.2 timm==0.3.2 tensorboard einops

iTPN supports pre-training using pixel and CLIP as supervision. For the latter, please first download the CLIP models (We use CLIP-B/16 and CLIP-L/14 models in the paper).

Main Results

<img src="assets/ft_in1k.jpg" alt="iTPN" width="40%"> Table 1: Top-1 classification accuracy (%) by fine-tuning the pre-trained models on ImageNet-1K. We compare models of different levels and supervisions (e.g., with and without CLIP) separately. <img src="assets/ft_coco_ade.jpg" alt="iTPN" width="70%"> Table 2: Visual recognition results (%) on COCO and ADE20K. Mask R-CNN (abbr. MR, 1x/3x) and Cascade Mask R-CNN (abbr. CMR, 1x) are used on COCO, and UPerHead with 512x512 input is used on ADE20K. For the base-level models, each cell of COCO results contains object detection (box) and instance segmentation (mask) APs. For the large-level models, the accuracy of 1x Mask R-CNN surpasses all existing methods.

License

iTPN is released under the License.

Citation

@article{tian2024fast,
  title={Fast-iTPN: Integrally pre-trained transformer pyramid network with token migration},
  author={Tian, Yunjie and Xie, Lingxi and Qiu, Jihao and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}

@inproceedings{tian2023integrally,
  title={Integrally pre-trained transformer pyramid networks},
  author={Tian, Yunjie and Xie, Lingxi and Wang, Zhaozhi and Wei, Longhui and Zhang, Xiaopeng and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18610--18620},
  year={2023}
}

@inproceedings{zhang2023hivit,
  title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
  author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
  booktitle={International Conference on Learning Representations},
  year={2023}
}

Related Skills

node-connect

337.1k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

83.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

337.1k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

commit-push-pr

83.1k

Commit, push, and open a PR