[TNNLS 2025] TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition

This is an official PyTorch implementation of "TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition".

📝 Paper: Journal Version | arXiv Version

Introduction

TransXNet is a CNN-Transformer hybrid vision backbone that can model both global and local dynamics with a Dual Dynamic Token Mixer (D-Mixer), achieving superior performance over both CNN and Transformer-based models.

Image Classification

1. Requirements

We highly suggest using our provided dependencies to ensure reproducibility:

# Environments:
cuda==11.6
python==3.8.15
# Packages:
mmcv==1.7.1
timm==0.6.12
torch==1.13.1
torchvision==0.14.1

2. Data Preparation

Prepare ImageNet with the following folder structure, you can extract ImageNet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

3. Main Results on ImageNet with Pretrained Models

| Models | Input Size | FLOPs (G) | Params (M) | Top-1 Acc.(%) | Download | |:-----------:|:----------:|:---------:|:----------:|:----------:|:----------:| | TransXNet-T | 224x224 | 1.8 | 12.8 | 81.6 | model | | TransXNet-S | 224x224 | 4.5 | 26.9 | 83.8 | model | | TransXNet-B | 224x224 | 8.3 | 48.0 | 84.6 | model | | TransXNet-B | 384x384 | 24.2 | 48.0 | 85.5 | model |

4. Train

To train TransXNet models on ImageNet-1K with 8 gpus (single node), run:

bash scripts/train_tiny.sh  # train TransXNet-T
bash scripts/train_small.sh # train TransXNet-S
bash scripts/train_base.sh  # train TransXNet-B

5. Validation

To evaluate TransXNet on ImageNet-1K, run:

MODEL=transxnet_t # transxnet_{t, s, b}
python3 validate.py \
/path/to/imagenet \
--model $MODEL -b 128 \
--pretrained # or --checkpoint /path/to/checkpoint

Object Detection and Semantic Segmentation

Object Detection
Semantic Segmentation

Citation

If you find this project useful for your research, please consider citing:

@article{lou2023transxnet,
  title={TransXNet: Learning Both Global and Local Dynamics with a Dual Dynamic Token Mixer for Visual Recognition},
  author={Meng Lou and Shu Zhang and Hong-Yu Zhou and Sibei Yang and Chuan Wu and Yizhou Yu},
  journal={IEEE Transactions on Neural Networks and Learning Systems},
  year={2025}
}

Acknowledgment

Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.

poolformer
mmdetection
mmsegmentation
pytorch-image-models

Contact

If you have any questions, please feel free to create issues or contact me at lmzmm.0921@gmail.com.

TransXNet

Install / Use

README