TransPose
PyTorch Implementation for "TransPose: Keypoint localization via Transformer", ICCV 2021.
Install / Use
/learn @yangsenius/TransPoseREADME
Introduction
TransPose is a human pose estimation model based on a CNN feature extractor, a Transformer Encoder, and a prediction head. Given an image, the attention layers built in Transformer can efficiently capture long-range spatial relationships between keypoints and explain what dependencies the predicted keypoints locations highly rely on.

[arxiv 2012.14214] [paper] [demo-notebook]
TransPose: Keypoint Localization via Transformer, Sen Yang, Zhibin Quan, Mu Nie, Wankou Yang, ICCV 2021
Model Zoo
We choose two types of CNNs as the backbone candidates: ResNet and HRNet. The derived convolutional blocks are ResNet-Small, HRNet-Small-W32, and HRNet-Small-W48.
| Model | Backbone | #Attention layers | d | h | #Heads | #Params | AP (coco val gt bbox) | Download | | -------------- | ----------- | :---------------: | :--: | :--: | :----: | :-----: | :-------------------: | :------: | | TransPose-R-A3 | ResNet-S | 3 | 256 | 1024 | 8 | 5.2Mb | 73.8 | model | | TransPose-R-A4 | ResNet-S | 4 | 256 | 1024 | 8 | 6.0Mb | 75.1 | model | | TransPose-H-S | HRNet-S-W32 | 4 | 64 | 128 | 1 | 8.0Mb | 76.1 | model | | TransPose-H-A4 | HRNet-S-W48 | 4 | 96 | 192 | 1 | 17.3Mb | 77.5 | model | | TransPose-H-A6 | HRNet-S-W48 | 6 | 96 | 192 | 1 | 17.5Mb | 78.1 | model |
Quick use
You can directly load TransPose-R-A4 or TransPose-H-A4 models with pretrained weights on COCO train2017 dataset from Torch Hub, simply by:
import torch
tpr = torch.hub.load('yangsenius/TransPose:main', 'tpr_a4_256x192', pretrained=True)
tph = torch.hub.load('yangsenius/TransPose:main', 'tph_a4_256x192', pretrained=True)
Results on COCO val2017 with detector having human AP of 56.4 on COCO val2017 dataset
| Model | Input size | FPS* | GFLOPs | AP | Ap .5 | AP .75 | AP (M) | AP (L) | AR | AR .5 | AR .75 | AR (M) | AR (L) | | :------------: | :--------: | :--: | :----: | ----- | ----- | :----: | :----: | :----: | :---: | :---: | :----: | :----: | :----: | | TransPose-R-A3 | 256x192 | 141 | 8.0 | 0.717 | 0.889 | 0.788 | 0.680 | 0.786 | 0.771 | 0.930 | 0.836 | 0.727 | 0.835 | | TransPose-R-A4 | 256x192 | 138 | 8.9 | 0.726 | 0.891 | 0.799 | 0.688 | 0.798 | 0.780 | 0.931 | 0.845 | 0.735 | 0.844 | | TransPose-H-S | 256x192 | 45 | 10.2 | 0.742 | 0.896 | 0.808 | 0.706 | 0.810 | 0.795 | 0.935 | 0.855 | 0.752 | 0.856 | | TransPose-H-A4 | 256x192 | 41 | 17.5 | 0.753 | 0.900 | 0.818 | 0.717 | 0.821 | 0.803 | 0.939 | 0.861 | 0.761 | 0.865 | | TransPose-H-A6 | 256x192 | 38 | 21.8 | 0.758 | 0.901 | 0.821 | 0.719 | 0.828 | 0.808 | 0.939 | 0.864 | 0.764 | 0.872 |
Note:
- we computed the average FPS* of testing 100 samples from coco val dataset (with batchsize=1) on a single NVIDIA 2080Ti GPU. The FPS may fluctuate up and down at different tests.
- We trained our different models on different hardware platforms: 1 x RTX2080Ti GPUs (TP-R-A4), 4 x TiTan XP GPUs (TP-H-S, TP-H-A4), and 4 x Tesla P40 GPUs (TP-H-A6).
Results on COCO test-dev2017 with detector having human AP of 60.9 on COCO test-dev2017 dataset
| Model | Input size | #Params | GFLOPs | AP | Ap .5 | AP .75 | AP (M) | AP (L) | AR | AR .5 | AR .75 | AR (M) | AR (L) | | -------------- | ---------- | ------- | ------ | ----- | ----- | ------ | ------ | ------ | ----- | ----- | ------ | ------ | ------ | | TransPose-H-S | 256x192 | 8.0M | 10.2 | 0.734 | 0.916 | 0.811 | 0.701 | 0.793 | 0.786 | 0.950 | 0.856 | 0.745 | 0.843 | | TransPose-H-A4 | 256x192 | 17.3M | 17.5 | 0.747 | 0.919 | 0.822 | 0.714 | 0.807 | 0.799 | 0.953 | 0.866 | 0.758 | 0.854 | | TransPose-H-A6 | 256x192 | 17.5M | 21.8 | 0.750 | 0.922 | 0.823 | 0.713 | 0.811 | 0.801 | 0.954 | 0.867 | 0.759 | 0.859 |
Visualization
Given an input image, a pretrained TransPose model, and the predicted locations, we can visualize the spatial dependencies of the predicted locations with threshold for the attention scores.
TransPose-R-A4 with threshold=0.00

TransPose-R-A4 with threshold=0.01

TransPose-H-A4 with threshold=0.00

TransPose-H-A4 with threshold=0.00075

Getting started
Installation
-
Clone this repository, and we'll call the directory that you cloned as ${POSE_ROOT}
git clone https://github.com/yangsenius/TransPose.git -
Install PyTorch>=1.6 and torchvision>=0.7 from the PyTorch official website
-
Install package dependencies. Make sure the python environment >=3.7
pip install -r requirements.txt -
Make output (training models and files) and log (tensorboard log) directories under ${POSE_ROOT} & Make libs
mkdir output log cd ${POSE_ROOT}/lib make -
Download pretrained models from the releases of this repo to the specified directory
${POSE_ROOT} `-- models `-- pytorch |-- imagenet | |-- hrnet_w32-36af842e.pth | |-- hrnet_w48-8ef0771d.pth | |-- resnet50-19c8e357.pth |-- transpose_coco | |-- tp_r_256x192_enc3_d256_h1024_mh8.pth | |-- tp_r_256x192_enc4_d256_h1024_mh8.pth | |-- tp_h_32_256x192_enc4_d64_h128_mh1.pth | |-- tp_h_48_256x192_enc4_d96_h192_mh1.pth | |-- tp_h_48_256x192_enc6_d96_h192_mh1.pth
Data Preparation
We follow the steps of HRNet to prepare the COCO train/val/test dataset and the annotations. The detected person results are downloaded from OneDrive or GoogleDrive. Please download or link them to ${POSE_ROOT}/data/coco/, and make them look like this:
${POSE_ROOT}/data/coco/
|-- annotations
| |-- person_keypoints_train2017.json
| `-- person_keypoints_val2017.json
|-- person_detection_results
| |-- COCO_val2017_detections_AP_H_56_person.json
| `-- COCO_test-dev2017_detections_AP_H_609_person.json
`-- images
|-- train2017
| |-- 000000000009.jpg
| |-- ...
`-- val2017
|-- 000000000139.jpg
|-- ...
Traing & Testing
Testing on COCO val2017 dataset
python tools/test.py --cfg experiments/coco/transpose_r/TP_R_256x192_d256_h1024_enc4_mh8.yaml TEST.USE_GT_BBOX True
Training on COCO train2017 dataset
python tools/train.py --cfg experiments/coco/transpose_r/TP_R_256x192_d256_h1024_enc4_mh8.yaml
Acknowledgements
Great thanks for these papers and their open-source codes:HRNet, DETR, DarkPose
License
This repository is released under the MIT LICENSE.
Citation
If you find this repository useful please give it a star 🌟 or consider citing our work:
@inproceedings{yang2021transpose,
title={TransPose: Keypoint Localization via Transformer},
author={Yang, Sen and Quan, Zhibin and Nie, Mu and Yang, Wankou},
booktitle={IEEE/CVF International Conference on Computer Vision (ICCV)},
year={2021}
}
Related Skills
node-connect
336.2kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
82.8kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
336.2kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
commit-push-pr
82.8kCommit, push, and open a PR
