Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

<div> <a href='https://chaolongy.github.io/' target='_blank'>Chaolong Yang <sup>1,3*</sup> </a>&emsp; <a href='https://kaiseem.github.io/' target='_blank'>Kai Yao <sup>2*</a>&emsp; <a href='https://scholar.xjtlu.edu.cn/en/persons/YuyaoYan' target='_blank'>Yuyao Yan <sup>3</sup> </a>&emsp; <a href='https://scholar.google.com/citations?hl=zh-CN&user=HDO58yUAAAAJ' target='_blank'>Chenru Jiang <sup>4</sup> </a>&emsp; <a href='https://weiguangzhao.github.io/' target='_blank'>Weiguang Zhao <sup>1,3</sup> </a>&emsp; </br> <a href='https://scholar.google.com/citations?hl=zh-CN&user=c-x5M2QAAAAJ' target='_blank'>Jie Sun <sup>3†</sup> </a>&emsp; <a href='https://sites.google.com/view/guangliangcheng' target='_blank'>Guangliang Cheng <sup>1</sup> </a>&emsp; <a href='https://scholar.google.com/schhp?hl=zh-CN' target='_blank'>Yifei Zhang <sup>5</sup> </a>&emsp; <a href='https://scholar.google.com/citations?hl=zh-CN&user=JNRMVNYAAAAJ&view_op=list_works&sortby=pubdate' target='_blank'>Bin Dong <sup>4</sup> </a>&emsp; <a href='https://sites.google.com/view/kaizhu-huang-homepage/home' target='_blank'>Kaizhu Huang <sup>4†</sup> </a>&emsp; </div> <br> <div> <sup>1</sup> University of Liverpool &emsp; <sup>2</sup> Ant Group &emsp; <sup>3</sup> Xi’an Jiaotong-Liverpool University &emsp; </br> <sup>4</sup> Duke Kunshan University &emsp; <sup>5</sup> Ricoh Software Research Center &emsp; </div> <div align="justify">

News

[2025.09.03] Our paper was accepted by the International Journal of Computer Vision (IJCV).

[2025.07.30] Training and evaluation codes have been released.

[2025.07.03] Our demo KDTalker++ was accepted by the 2025 ACM Multimedia Demo and Video Track.

[2025.05.26] Important update! New models and new functions have been updated to local deployment KDTalker. New functions include background replacement and expression editing.

[2025.04.13] A more powerful TTS has been updated in our local deployment KDTalker.

[2025.03.14] Release paper version demo and inference code.

Comparative videos

https://github.com/user-attachments/assets/08ebc6e0-41c5-4bf4-8ee8-2f7d317d92cd

Demo

Local deployment(4090) demo KDTalker.

You can also visit the demo deployed on Huggingface, where inference is slower due to ZeroGPU.

Environment

Our KDTalker could be conducted on one RTX4090 or RTX3090.

1. Clone the code and prepare the environment

Note: Make sure your system has git, conda, and FFmpeg installed.

git clone https://github.com/chaolongy/KDTalker
cd KDTalker

# create env using conda
conda create -n KDTalker python=3.9
conda activate KDTalker

conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia

pip install -r requirements.txt

2. Download pretrained weights

First, you can download all LiverPorait pretrained weights from Google Drive. Unzip and place them in ./pretrained_weights. Ensuring the directory structure is as follows:

pretrained_weights
├── insightface
│   └── models
│       └── buffalo_l
│           ├── 2d106det.onnx
│           └── det_10g.onnx
└── liveportrait
    ├── base_models
    │   ├── appearance_feature_extractor.pth
    │   ├── motion_extractor.pth
    │   ├── spade_generator.pth
    │   └── warping_module.pth
    ├── landmark.onnx
    └── retargeting_models
        └── stitching_retargeting_module.pth

You can download the weights for the face detector, audio extractor and KDTalker from Google Drive. Put them in ./ckpts.

OR, you can download above all weights in Huggingface.

Training

1. Data processing

python ./dataset_process/extract_motion_dataset.py -mp4_root ./path_to_your_video_root

2. Calculate data norm

python ./dataset_process/cal_norm.py

3. Configure wandb and train

Please configure your own "WANDB_API_KEY" on ./config/structured.py. Then execute the code ./main.py

python main.py

Inference

python inference.py -source_image ./example/source_image/WDA_BenCardin1_000.png -driven_audio ./example/driven_audio/WDA_BenCardin1_000.wav -output ./results/output.mp4

Evaluation

1. Diversity

First, please download the Hopenet pretrained weights from Google Drive. Put it in ./evaluation/deep-head-pose/, and then execute the code ./evaluation/deep-head-pose/test_on_video_dlib.py.

python test_on_video_dlib.py -video ./path_to_your_video_root

Finally, calculating the standard deviation.

python cal_std.py

2. Beat align

python cal_beat_align_score.py -video_root ./path_to_your_video_root

3. LSE-C and LSE-D

Please configure it as follows: Wav2lip.

Contact

Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at chaolong.yang@liverpool.ac.uk

Citation

If you find this code helpful for your research, please cite:

@article{Yang2026,
  author  = {Yang, Chaolong and Yao, Kai and Yan, Yuyao and Jiang, Chenru and Zhao, Weiguang and Sun, Jie and Cheng, Guangliang and Zhang, Yifei and Dong, Bin and Huang, Kaizhu},
  title   = {Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait},
  journal = {International Journal of Computer Vision},
  year    = {2026},
  volume  = {134},
  number  = {3},
  pages   = {111},
  doi     = {10.1007/s11263-025-02695-x},
  url     = {https://doi.org/10.1007/s11263-025-02695-x},
  issn    = {1573-1405},
  date    = {2026-02-06},
}


@inproceedings{Yang2025,
  author = {Yang, Chaolong and Guo, Yinuo and Yao, Kai and Yan, Yuyao and Sun, Jie and Huang, Kaizhu},
  title = {KDTalker++: Controllable Talking Portrait Generation with Audio, Text, and Expression Editing},
  year = {2025},
  isbn = {9798400720352},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3746027.3754462},
  doi = {10.1145/3746027.3754462},
  booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
  pages = {13486–13488},
  numpages = {3},
  keywords = {audio-driven, diffusion, talking portrait generation},
  location = {Dublin, Ireland},
  series = {MM '25}
}

Acknowledge

We acknowledge these works for their public code and selfless help: SadTalker, LivePortrait, Wav2Lip, Face-vid2vid, deep-head-pose, Bailando, etc.

</div>

KDTalker

Install / Use

README