ITPN
(CVPR2023/TPAMI2024) Integrally Pre-Trained Transformer Pyramid Networks -- A Hierarchical Vision Transformer for Masked Image Modeling
Install / Use
/learn @sunsmarterjie/ITPNREADME
Integrally Pre-Trained Transformer Pyramid Networks
Fast-iTPN: Integrally Pre-Trained Transformer Pyramid Network with Token Migration
[CVPR2023/TPAMI2024]
(A Simple Hierarchical Vision Transformer Meets Masked Image Modeling)
<p align="center"> <img src="assets/framework.png" alt="iTPN" width="90%"> </p> <p align="center"> Figure 1: The comparison between a conventional pre-training (left) and the proposed integral pre-training framework (right). We use a feature pyramid as the unified neck module and apply masked feature modeling for pre-training the feature pyramid. The green and red blocks indicate that the network weights are pre-trained and un-trained (i.e., randomly initialized for fine-tuning), respectively. </p> </div>Updates
11/Jul./2024
Fast-iTPN is accepted by TPAMI2024.
08/Jan./2024
Fast-iTPN is public at arxiv. Fast-iTPN is a more powerful version of iTPN.
26/Dec./2023
| model | Para. (M) | Pre-train | teacher | input/patch | 21K ft? | Acc on IN.1K | checkpoint | checkpoint (21K)| | :---: | :---: |:---: |:---: |:---: |:---: | :---: | :---: | :---: | | Fast-iTPN-T| 24 |IN.1K |CLIP-L|224/16|N|85.1%|baidu/google | | | Fast-iTPN-T| 24 |IN.1K |CLIP-L|384/16|N|86.2%||| | Fast-iTPN-T| 24 |IN.1K |CLIP-L|512/16|N|86.5%||| | Fast-iTPN-S| 40 |IN.1K |CLIP-L|224/16|N|86.4%|baidu/google | | | Fast-iTPN-S| 40 |IN.1K |CLIP-L|384/16|N|86.95%| | | | Fast-iTPN-S| 40 |IN.1K |CLIP-L|512/16|N|87.8%| | | | Fast-iTPN-B| 85 | IN.1K |CLIP-L|224/16|N|87.4%|baidu/google | | | Fast-iTPN-B| 85 | IN.1K |CLIP-L|512/16|N|88.5%| | | | Fast-iTPN-B| 85 | IN.1K |CLIP-L|512/16|Y|88.75%| | baidu/google | | Fast-iTPN-L| 312 |IN.1K |CLIP-L|640/16|N|89.5%|baidu/google | |
All the pre-trained Fast-iTPN models are available now (passward: itpn) ! The tiny/small/base scale models report the best performance on ImageNet-1K as far as we know. Use them for your own tasks! See Details.
30/May/2023
| model | Pre-train | teacher | input/patch | 21K ft? | Acc on IN.1K | | :---: | :---: |:---: |:---: |:---: |:---: | | EVA-02-B | IN.21K |EVA-CLIP-g|196/14|N|87.0%| | EVA-02-B | IN.21K |EVA-CLIP-g|448/14|N|88.3%| | EVA-02-B | IN.21K |EVA-CLIP-g|448/14|Y|88.6%| | Fast-iTPN-B|IN.1K |CLIP-L|224/16|N|87.4%| | Fast-iTPN-B|IN.1K |CLIP-L|512/16|N|88.5%| | Fast-iTPN-B|IN.1K |CLIP-L|512/16|Y|88.7%|
All the models above are only pre-trained on ImageNet-1K and these models will be available soon.
29/May/2023
The iTPN-L-CLIP/16 intermediate fine-tuned model is available (password:itpn) pretrained on 21K, and fine-tuned on 1K. Evaluating the latter one on ImageNet-1K obtains 89.2% accuracy.
28/Feb./2023
iTPN is accepted by CVPR2023!
08/Feb./2023
The iTPN-L-CLIP/16 model reaches 89.2% fine-tuning performance on ImageNet-1K.
configurations: intermediate fine-tuning on ImageNet-21K + 384 input size
21/Jan./2023
Our HiViT is accepted by ICLR2023!
HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer
08/Dec./2022
Get checkpoints (password: abcd): | |iTPN-B-pixel | iTPN-B-CLIP | iTPN-L-pixel | iTPN-L-CLIP/16| | :---: | :---: |:---: |:---: |:---: | | baidu drive |download|download|download|download| | google drive |download|download|download|download|
25/Nov./2022
The preprint version is public at arxiv.
Requiments
- Ubuntu
- Python 3.7+
- CUDA 10.2+
- GCC 5+
- Pytorch 1.7+
Dataset
- ImageNet-1K
- COCO2017
- ADE20K
Get Started
Prepare the environment:
conda create --name itpn python=3.8 -y
conda activate itpn
git clone git@github.com:sunsmarterjie/iTPN.git
cd iTPN
pip install torch==1.7.1+cu10.2 torchvision==0.8.2+cu10.2 timm==0.3.2 tensorboard einops
iTPN supports pre-training using pixel and CLIP as supervision. For the latter, please first download the CLIP models (We use CLIP-B/16 and CLIP-L/14 models in the paper).
Main Results
<p align="center"> <img src="assets/ft_in1k.jpg" alt="iTPN" width="40%"> </p> <p align="center"> Table 1: Top-1 classification accuracy (%) by fine-tuning the pre-trained models on ImageNet-1K. We compare models of different levels and supervisions (e.g., with and without CLIP) separately. </p> <p align="center"> <img src="assets/ft_coco_ade.jpg" alt="iTPN" width="70%"> </p> <p align="center"> Table 2: Visual recognition results (%) on COCO and ADE20K. Mask R-CNN (abbr. MR, 1x/3x) and Cascade Mask R-CNN (abbr. CMR, 1x) are used on COCO, and UPerHead with 512x512 input is used on ADE20K. For the base-level models, each cell of COCO results contains object detection (box) and instance segmentation (mask) APs. For the large-level models, the accuracy of 1x Mask R-CNN surpasses all existing methods. </p>License
iTPN is released under the License.
Citation
@article{tian2024fast,
title={Fast-iTPN: Integrally pre-trained transformer pyramid network with token migration},
author={Tian, Yunjie and Xie, Lingxi and Qiu, Jihao and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
year={2024},
publisher={IEEE}
}
@inproceedings{tian2023integrally,
title={Integrally pre-trained transformer pyramid networks},
author={Tian, Yunjie and Xie, Lingxi and Wang, Zhaozhi and Wei, Longhui and Zhang, Xiaopeng and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
pages={18610--18620},
year={2023}
}
@inproceedings{zhang2023hivit,
title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
booktitle={International Conference on Learning Representations},
year={2023}
}
Related Skills
node-connect
337.1kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
83.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
337.1kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
83.1kCommit, push, and open a PR
