KDTalker
[IJCV 2025] Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait
Install / Use
/learn @chaolongy/KDTalkerREADME
Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait
<div> <a href='https://chaolongy.github.io/' target='_blank'>Chaolong Yang <sup>1,3*</sup> </a>  <a href='https://kaiseem.github.io/' target='_blank'>Kai Yao <sup>2*</a>  <a href='https://scholar.xjtlu.edu.cn/en/persons/YuyaoYan' target='_blank'>Yuyao Yan <sup>3</sup> </a>  <a href='https://scholar.google.com/citations?hl=zh-CN&user=HDO58yUAAAAJ' target='_blank'>Chenru Jiang <sup>4</sup> </a>  <a href='https://weiguangzhao.github.io/' target='_blank'>Weiguang Zhao <sup>1,3</sup> </a>  </br> <a href='https://scholar.google.com/citations?hl=zh-CN&user=c-x5M2QAAAAJ' target='_blank'>Jie Sun <sup>3†</sup> </a>  <a href='https://sites.google.com/view/guangliangcheng' target='_blank'>Guangliang Cheng <sup>1</sup> </a>  <a href='https://scholar.google.com/schhp?hl=zh-CN' target='_blank'>Yifei Zhang <sup>5</sup> </a>  <a href='https://scholar.google.com/citations?hl=zh-CN&user=JNRMVNYAAAAJ&view_op=list_works&sortby=pubdate' target='_blank'>Bin Dong <sup>4</sup> </a>  <a href='https://sites.google.com/view/kaizhu-huang-homepage/home' target='_blank'>Kaizhu Huang <sup>4†</sup> </a>  </div> <br> <div> <sup>1</sup> University of Liverpool   <sup>2</sup> Ant Group   <sup>3</sup> Xi’an Jiaotong-Liverpool University   </br> <sup>4</sup> Duke Kunshan University   <sup>5</sup> Ricoh Software Research Center   </div> <div align="justify">News
[2025.09.03] Our paper was accepted by the International Journal of Computer Vision (IJCV).
[2025.07.30] Training and evaluation codes have been released.
[2025.07.03] Our demo KDTalker++ was accepted by the 2025 ACM Multimedia Demo and Video Track.
[2025.05.26] Important update! New models and new functions have been updated to local deployment KDTalker. New functions include background replacement and expression editing.
[2025.04.13] A more powerful TTS has been updated in our local deployment KDTalker.
[2025.03.14] Release paper version demo and inference code.
Comparative videos
https://github.com/user-attachments/assets/08ebc6e0-41c5-4bf4-8ee8-2f7d317d92cd
Demo
Local deployment(4090) demo KDTalker.
You can also visit the demo deployed on Huggingface, where inference is slower due to ZeroGPU.
Environment
Our KDTalker could be conducted on one RTX4090 or RTX3090.
1. Clone the code and prepare the environment
Note: Make sure your system has git, conda, and FFmpeg installed.
git clone https://github.com/chaolongy/KDTalker
cd KDTalker
# create env using conda
conda create -n KDTalker python=3.9
conda activate KDTalker
conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia
pip install -r requirements.txt
2. Download pretrained weights
First, you can download all LiverPorait pretrained weights from Google Drive. Unzip and place them in ./pretrained_weights.
Ensuring the directory structure is as follows:
pretrained_weights
├── insightface
│ └── models
│ └── buffalo_l
│ ├── 2d106det.onnx
│ └── det_10g.onnx
└── liveportrait
├── base_models
│ ├── appearance_feature_extractor.pth
│ ├── motion_extractor.pth
│ ├── spade_generator.pth
│ └── warping_module.pth
├── landmark.onnx
└── retargeting_models
└── stitching_retargeting_module.pth
You can download the weights for the face detector, audio extractor and KDTalker from Google Drive. Put them in ./ckpts.
OR, you can download above all weights in Huggingface.
Training
1. Data processing
python ./dataset_process/extract_motion_dataset.py -mp4_root ./path_to_your_video_root
2. Calculate data norm
python ./dataset_process/cal_norm.py
3. Configure wandb and train
Please configure your own "WANDB_API_KEY" on ./config/structured.py. Then execute the code ./main.py
python main.py
Inference
python inference.py -source_image ./example/source_image/WDA_BenCardin1_000.png -driven_audio ./example/driven_audio/WDA_BenCardin1_000.wav -output ./results/output.mp4
Evaluation
1. Diversity
First, please download the Hopenet pretrained weights from Google Drive. Put it in ./evaluation/deep-head-pose/, and then execute the code ./evaluation/deep-head-pose/test_on_video_dlib.py.
python test_on_video_dlib.py -video ./path_to_your_video_root
Finally, calculating the standard deviation.
python cal_std.py
2. Beat align
python cal_beat_align_score.py -video_root ./path_to_your_video_root
3. LSE-C and LSE-D
Please configure it as follows: Wav2lip.
Contact
Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at chaolong.yang@liverpool.ac.uk
Citation
If you find this code helpful for your research, please cite:
@article{Yang2026,
author = {Yang, Chaolong and Yao, Kai and Yan, Yuyao and Jiang, Chenru and Zhao, Weiguang and Sun, Jie and Cheng, Guangliang and Zhang, Yifei and Dong, Bin and Huang, Kaizhu},
title = {Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait},
journal = {International Journal of Computer Vision},
year = {2026},
volume = {134},
number = {3},
pages = {111},
doi = {10.1007/s11263-025-02695-x},
url = {https://doi.org/10.1007/s11263-025-02695-x},
issn = {1573-1405},
date = {2026-02-06},
}
@inproceedings{Yang2025,
author = {Yang, Chaolong and Guo, Yinuo and Yao, Kai and Yan, Yuyao and Sun, Jie and Huang, Kaizhu},
title = {KDTalker++: Controllable Talking Portrait Generation with Audio, Text, and Expression Editing},
year = {2025},
isbn = {9798400720352},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3746027.3754462},
doi = {10.1145/3746027.3754462},
booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
pages = {13486–13488},
numpages = {3},
keywords = {audio-driven, diffusion, talking portrait generation},
location = {Dublin, Ireland},
series = {MM '25}
}
Acknowledge
We acknowledge these works for their public code and selfless help: SadTalker, LivePortrait, Wav2Lip, Face-vid2vid, deep-head-pose, Bailando, etc.
</div>