SkillAgentSearch skills...

KDTalker

[IJCV 2025] Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

Install / Use

/learn @chaolongy/KDTalker
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center"> <img src='https://github.com/user-attachments/assets/3fdf69a7-e2db-4c61-aad0-109e6ccc51fa' width='600px'/>

Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait

Paper License GitHub Stars

<div> <a href='https://chaolongy.github.io/' target='_blank'>Chaolong Yang <sup>1,3*</sup> </a>&emsp; <a href='https://kaiseem.github.io/' target='_blank'>Kai Yao <sup>2*</a>&emsp; <a href='https://scholar.xjtlu.edu.cn/en/persons/YuyaoYan' target='_blank'>Yuyao Yan <sup>3</sup> </a>&emsp; <a href='https://scholar.google.com/citations?hl=zh-CN&user=HDO58yUAAAAJ' target='_blank'>Chenru Jiang <sup>4</sup> </a>&emsp; <a href='https://weiguangzhao.github.io/' target='_blank'>Weiguang Zhao <sup>1,3</sup> </a>&emsp; </br> <a href='https://scholar.google.com/citations?hl=zh-CN&user=c-x5M2QAAAAJ' target='_blank'>Jie Sun <sup>3†</sup> </a>&emsp; <a href='https://sites.google.com/view/guangliangcheng' target='_blank'>Guangliang Cheng <sup>1</sup> </a>&emsp; <a href='https://scholar.google.com/schhp?hl=zh-CN' target='_blank'>Yifei Zhang <sup>5</sup> </a>&emsp; <a href='https://scholar.google.com/citations?hl=zh-CN&user=JNRMVNYAAAAJ&view_op=list_works&sortby=pubdate' target='_blank'>Bin Dong <sup>4</sup> </a>&emsp; <a href='https://sites.google.com/view/kaizhu-huang-homepage/home' target='_blank'>Kaizhu Huang <sup>4†</sup> </a>&emsp; </div> <br> <div> <sup>1</sup> University of Liverpool &emsp; <sup>2</sup> Ant Group &emsp; <sup>3</sup> Xi’an Jiaotong-Liverpool University &emsp; </br> <sup>4</sup> Duke Kunshan University &emsp; <sup>5</sup> Ricoh Software Research Center &emsp; </div> <div align="justify">

News

[2025.09.03] Our paper was accepted by the International Journal of Computer Vision (IJCV).

[2025.07.30] Training and evaluation codes have been released.

[2025.07.03] Our demo KDTalker++ was accepted by the 2025 ACM Multimedia Demo and Video Track.

[2025.05.26] Important update! New models and new functions have been updated to local deployment KDTalker. New functions include background replacement and expression editing.

[2025.04.13] A more powerful TTS has been updated in our local deployment KDTalker.

[2025.03.14] Release paper version demo and inference code.

Comparative videos

https://github.com/user-attachments/assets/08ebc6e0-41c5-4bf4-8ee8-2f7d317d92cd

Demo

Local deployment(4090) demo KDTalker.

You can also visit the demo deployed on Huggingface, where inference is slower due to ZeroGPU.

<img width="2789" height="1553" alt="Demo" src="https://github.com/user-attachments/assets/387c9cab-4d79-48b2-96d7-f9271fe9f1d6" />

Environment

Our KDTalker could be conducted on one RTX4090 or RTX3090.

1. Clone the code and prepare the environment

Note: Make sure your system has git, conda, and FFmpeg installed.

git clone https://github.com/chaolongy/KDTalker
cd KDTalker

# create env using conda
conda create -n KDTalker python=3.9
conda activate KDTalker

conda install pytorch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 pytorch-cuda=11.8 -c pytorch -c nvidia

pip install -r requirements.txt

2. Download pretrained weights

First, you can download all LiverPorait pretrained weights from Google Drive. Unzip and place them in ./pretrained_weights. Ensuring the directory structure is as follows:

pretrained_weights
├── insightface
│   └── models
│       └── buffalo_l
│           ├── 2d106det.onnx
│           └── det_10g.onnx
└── liveportrait
    ├── base_models
    │   ├── appearance_feature_extractor.pth
    │   ├── motion_extractor.pth
    │   ├── spade_generator.pth
    │   └── warping_module.pth
    ├── landmark.onnx
    └── retargeting_models
        └── stitching_retargeting_module.pth

You can download the weights for the face detector, audio extractor and KDTalker from Google Drive. Put them in ./ckpts.

OR, you can download above all weights in Huggingface.

Training

1. Data processing

python ./dataset_process/extract_motion_dataset.py -mp4_root ./path_to_your_video_root

2. Calculate data norm

python ./dataset_process/cal_norm.py

3. Configure wandb and train

Please configure your own "WANDB_API_KEY" on ./config/structured.py. Then execute the code ./main.py

python main.py

Inference

python inference.py -source_image ./example/source_image/WDA_BenCardin1_000.png -driven_audio ./example/driven_audio/WDA_BenCardin1_000.wav -output ./results/output.mp4

Evaluation

1. Diversity

First, please download the Hopenet pretrained weights from Google Drive. Put it in ./evaluation/deep-head-pose/, and then execute the code ./evaluation/deep-head-pose/test_on_video_dlib.py.

python test_on_video_dlib.py -video ./path_to_your_video_root

Finally, calculating the standard deviation.

python cal_std.py

2. Beat align

python cal_beat_align_score.py -video_root ./path_to_your_video_root

3. LSE-C and LSE-D

Please configure it as follows: Wav2lip.

Contact

Our code is under the CC-BY-NC 4.0 license and intended solely for research purposes. If you have any questions or wish to use it for commercial purposes, please contact us at chaolong.yang@liverpool.ac.uk

Citation

If you find this code helpful for your research, please cite:

@article{Yang2026,
  author  = {Yang, Chaolong and Yao, Kai and Yan, Yuyao and Jiang, Chenru and Zhao, Weiguang and Sun, Jie and Cheng, Guangliang and Zhang, Yifei and Dong, Bin and Huang, Kaizhu},
  title   = {Unlock Pose Diversity: Accurate and Efficient Implicit Keypoint-based Spatiotemporal Diffusion for Audio-driven Talking Portrait},
  journal = {International Journal of Computer Vision},
  year    = {2026},
  volume  = {134},
  number  = {3},
  pages   = {111},
  doi     = {10.1007/s11263-025-02695-x},
  url     = {https://doi.org/10.1007/s11263-025-02695-x},
  issn    = {1573-1405},
  date    = {2026-02-06},
}


@inproceedings{Yang2025,
  author = {Yang, Chaolong and Guo, Yinuo and Yao, Kai and Yan, Yuyao and Sun, Jie and Huang, Kaizhu},
  title = {KDTalker++: Controllable Talking Portrait Generation with Audio, Text, and Expression Editing},
  year = {2025},
  isbn = {9798400720352},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3746027.3754462},
  doi = {10.1145/3746027.3754462},
  booktitle = {Proceedings of the 33rd ACM International Conference on Multimedia},
  pages = {13486–13488},
  numpages = {3},
  keywords = {audio-driven, diffusion, talking portrait generation},
  location = {Dublin, Ireland},
  series = {MM '25}
}

Acknowledge

We acknowledge these works for their public code and selfless help: SadTalker, LivePortrait, Wav2Lip, Face-vid2vid, deep-head-pose, Bailando, etc.

</div>

Star History

Star History Chart

View on GitHub
GitHub Stars306
CategoryDevelopment
Updated12d ago
Forks40

Languages

Python

Security Score

80/100

Audited on Mar 20, 2026

No findings