SmoothNet
[ECCV 2022] Official implementation of the paper "SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos"
Install / Use
/learn @cure-lab/SmoothNetREADME
SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos (ECCV 2022)
This repo is the official implementation of "SmoothNet: A Plug-and-Play Network for Refining Human Poses in Videos". [Paper] [Project]
Update
-
[x] Support SmoothNet in MMPose Release v0.25.0 and MMHuman3D as a smoothing strategy!
-
[x] Clean version is released!
-
[x] To further improve SmoothNet as a near online smoothing strategy, we reduce the original window size 64 to 32 frames by default!
-
[x] We also provide the pretrained models with the window size 8, 16, 32 and 64 frames here.
It currently includes code, data, log and models for the following tasks:
- 2D human pose estimation
- 3D human pose estimation
- Body recovery via a SMPL model
Major Features
- Model training and evaluation for 2D pose, 3D pose, and SMPL body representation
- Supporting 6 popular datasets (AIST++, Human3.6M, Sub-JHMDB, MPI-INF-3DHP, MuPoTS-3D, 3DPW) and providing cleaned estimation results of 13 popular pose estimation backbones(SPIN, TCMR, VIBE, CPN, FCN, Hourglass, HRNet, RLE, VideoPose3D, TposeNet, EFT, PARE, SimplePose)
Description
When analyzing human motion videos, the output jitters from existing pose estimators are highly-unbalanced with varied estimation errors across frames. Most frames in a video are relatively easy to estimate and only suffer from slight jitters. In contrast, for rarely seen or occluded actions, the estimated positions of multiple joints largely deviate from the ground truth values for a consecutive sequence of frames, rendering significant jitters on them.
To tackle this problem, we propose to attach a dedicated temporal-only refinement network to existing pose estimators for jitter mitigation, named SmoothNet. Unlike existing learning-based solutions that employ spatio-temporal models to co-optimize per-frame precision and temporal smoothness at all the joints, SmoothNet models the natural smoothness characteristics in body movements by learning the long-range temporal relations of every joint without considering the noisy correlations among joints. With a simple yet effective motion-aware fully-connected network, SmoothNet improves the temporal smoothness of existing pose estimators significantly and enhances the estimation accuracy of those challenging frames as a side-effect. Moreover, as a temporal-only model, a unique advantage of SmoothNet is its strong transferability across various types of estimators and datasets. Comprehensive experiments on five datasets with eleven popular backbone networks across 2D and 3D pose estimation and body recovery tasks demonstrate the efficacy of the proposed solution. Our code and datasets are provided in the supplementary materials.
Results
SmoothNet is a plug-and-play post-processing network to smooth any outputs of existing pose estimators. To fit well across datasets, backbones, and modalities with lower MPJPE and PA-MPJPE, we provide THREE pre-trained models (Train on AIST-VIBE-3D, 3DPW-SPIN-3D, and H36M-FCN-3D) to handle all existing issues.
Please refer to our supplementary materials to check the cross-model validation in detail. Noted that all models can obtain lower and similar Accels than the compared backbone estimators. The differences are in MPJPEs and PA-MPJPEs.
Due to the temporal-only network without spatial modelings, SmoothNet is trained on 3D position representations only, and can be tested on 2D, 3D, and 6D representations, respectively.
3D Keypoint Results
| Dataset | Estimator | MPJPE (Input/Output):arrow_down: | Accel (Input/Output):arrow_down: | Pretrain model | | ------- | --------- | ------------------ | ------------------ | ------------ | | AIST++ | SPIN | 107.17/95.21 | 33.19/4.17 | checkpoint / config | | AIST++ | TCMR* | 106.72/105.51 | 6.4/4.24 | checkpoint / config| | AIST++ | VIBE* | 106.90/97.47 | 31.64/4.15 | checkpoint / config| | Human3.6M | FCN | 54.55/52.72 | 19.17/1.03 | checkpoint / config| | Human3.6M | RLE | 48.87/48.27 | 7.75/0.90 | checkpoint / config| | Human3.6M | TCMR* | 73.57/73.89 | 3.77/2.79 | checkpoint / config| | Human3.6M | VIBE* | 78.10/77.23 | 15.81/2.86 | checkpoint / config| | Human3.6M | Videopose(T=27)* | 50.13/50.04 | 3.53/0.88 | checkpoint / config| | Human3.6M | Videopose(T=81)* | 48.97/48.89 | 3.06/0.87 | checkpoint / config| | Human3.6M | Videopose(T=243)* | 48.11/48.05 | 2.82/0.87 | checkpoint / config| | MPI-INF-3DHP | SPIN | 100.74/92.89 | 28.54/6.54 | checkpoint / config| | MPI-INF-3DHP | TCMR* | 92.83/88.93 | 7.92/6.49 | checkpoint / config| | MPI-INF-3DHP | VIBE* | 92.39/87.57 | 22.37/6.5 | checkpoint / config| | MuPoTS | TposeNet* | 103.33/100.78 | 12.7/7.23 | checkpoint / config | | MuPoTS | TposeNet+RefineNet* | 93.97/91.78 | 9.53/7.21 | checkpoint / config | | 3DPW | EFT | 90.32/88.40 | 32.71/6.07 | checkpoint / config| | 3DPW | EFT | 90.32/86.39 | 32.71/6.30 | checkpoint / config(additional training)| | 3DPW | PARE | 78.91/78.11 | 25.64/5.91 | checkpoint / config| | 3DPW | SPIN | 96.85/95.84 | 34.55/6.17 | checkpoint / config| | 3DPW | TCMR* | 86.46/86.48 | 6.76/5.95 | checkpoint / config| | 3DPW | VIBE* | 82.97/81.49 | 23.16/5.98 | checkpoint / config|
2D Keypoint Results
| Dataset | Estimator | MPJPE (Input/Output):arrow_down: | Accel (Input/Output):arrow_down: | Pretrain model | | ------- | --------- | ------------------ | ------------------ | ------------ | | Human3.6M | CPN | 6.67/6.45 | 2.91/0.14 |checkpoint /
