TransNeXt

[CVPR 2024] Code release for TransNeXt model

Generate Convert Improve

Install / Use

/learn @DaiShiResearch/TransNeXt

About this skill

Quality Score

0/100

README

TransNeXt

Official PyTorch implementation of "TransNeXt: Robust Foveal Visual Perception for Vision Transformers" [CVPR 2024] .

🤗 Don’t hesitate to give me a ⭐️, if you are interested in this project!

Updates

2024.06.08 We have created an explanatory video for our paper. You can watch it on YouTube or BiliBili.

2024.04.20 We have released the complete training and inference code, pre-trained model weights, and training logs!

2024.02.26 Our paper has been accepted by CVPR 2024! 🎉

2023.11.28 We have submitted the preprint of our paper to Arxiv

2023.09.21 We have submitted our paper and the model code to OpenReview, where it is publicly accessible.

Current Progress

:heavy_check_mark: Release of model code and CUDA implementation for acceleration.

:heavy_check_mark: Release of comprehensive training and inference code.

:heavy_check_mark: Release of pretrained model weights and training logs.

Motivation and Highlights

Our study observes unnatural visual perception (blocky artifacts) prevalent in the ERF of many visual backbones. We found that these artifacts, related to the design of their token mixers, are common in existing efficient ViTs and CNNs, and are difficult to eliminate through deep layer stacking. Our proposed pixel-focused attention and aggregated attention, designed through biomimicry, have provided a very effective solution for the artifact phenomenon, achieving natural and smooth visual perception.
Our study reevaluates the conventional belief of superior multi-scale adaptability in CNNs over ViTs. We highlight two key findings:
- Existing large-kernel CNNs (RepLKNet, SLaK) exhibit a drastic performance degradation in multi-scale inference.
- Our proposed TransNeXt, employing length-scaled cosine attention and extrapolative positional bias, significantly outperforms ConvNeXt in largescale image extrapolation.

Methods

Pixel-focused attention (Left) & aggregated attention (Right):

Convolutional GLU (First on the right):

Convolutional GLU

Results

Image Classification, Detection and Segmentation:

experiment_figure

Attention Visualization:

foveal_peripheral_vision

Model Zoo

Image Classification

Classification code & weights & configs & training logs are >>>here<<<.

ImageNet-1K 224x224 pre-trained models:

| Model | #Params | #FLOPs |IN-1K | IN-A | IN-C↓ |IN-R|Sketch|IN-V2|Download |Config| Log | |:---:|:---:|:---:|:---:| :---:|:---:|:---:|:---:| :---:|:---:|:---:|:---:| | TransNeXt-Micro|12.8M|2.7G| 82.5 | 29.9 | 50.8|45.8|33.0|72.6|model |config|log | | TransNeXt-Tiny |28.2M|5.7G| 84.0| 39.9| 46.5|49.6|37.6|73.8|model|config|log| | TransNeXt-Small |49.7M|10.3G| 84.7| 47.1| 43.9|52.5| 39.7|74.8 |model|config|log| | TransNeXt-Base |89.7M|18.4G| 84.8| 50.6|43.5|53.9|41.4|75.1| model|config|log|

ImageNet-1K 384x384 fine-tuned models:

| Model | #Params | #FLOPs |IN-1K | IN-A |IN-R|Sketch|IN-V2| Download |Config| |:---:|:---:|:---:|:---:| :---:|:---:|:---:| :---:|:---:|:---:| | TransNeXt-Small |49.7M|32.1G| 86.0| 58.3|56.4|43.2|76.8| model|config| | TransNeXt-Base |89.7M|56.3G| 86.2| 61.6|57.7|44.7|77.0| model|config|

ImageNet-1K 256x256 pre-trained model fully utilizing aggregated attention at all stages:

(See Table.9 in Appendix D.6 for details)

| Model |Token mixer| #Params | #FLOPs |IN-1K |Download |Config| Log | |:---:|:---:|:---:|:---:| :---:|:---:|:---:|:---:| |TransNeXt-Micro|A-A-A-A|13.1M|3.3G| 82.6 |model |config|log |

Object Detection

Object detection code & weights & configs & training logs are >>>here<<<.

COCO object detection and instance segmentation results using the Mask R-CNN method:

| Backbone | Pretrained Model| Lr Schd| box mAP | mask mAP | #Params | Download |Config| Log | |:---:|:---:|:---:|:---:| :---:|:---:|:---:|:---:|:---:| | TransNeXt-Tiny | ImageNet-1K |1x|49.9|44.6|47.9M|model|config|log| | TransNeXt-Small | ImageNet-1K |1x|51.1|45.5|69.3M|model|config|log| | TransNeXt-Base | ImageNet-1K |1x|51.7|45.9|109.2M|model|config|log|

When we checked the training logs, we found that the mask mAP and other detailed performance of the Mask R-CNN using the TransNeXt-Tiny backbone were even better than reported in the paper (versions V1 and V2). We have already fixed this in version V3 (it should be a data entry error).

COCO object detection results using the DINO method:

| Backbone | Pretrained Model| scales | epochs | box mAP | #Params | Download |Config| Log | |:---:|:---:|:---:|:---:| :---:|:---:|:---:|:---:|:---:| | TransNeXt-Tiny | [ImageNet-1K](https://huggingface.co/DaiShiResearch/transnext-tiny-224-1k/resolve

Related Skills

node-connect

348.0k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

108.8k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

348.0k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

348.0k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。