AlignDiT
[ACM MM 2025] AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
Install / Use
/learn @kaistmm/AlignDiTREADME
AlignDiT
Official PyTorch implementation for the following paper:
<div align="center"><img width="80%" src="https://mm.kaist.ac.kr/projects/AlignDiT/imgs/fig1.png?raw=true"/></div>AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation<br> Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, Joon Son Chung<br> ACM MM 2025<br> [Paper] [Project]
Datasets
To help verify your setup and ensure reproducibility, we provide the following debug data.
Path | Dataset | Debug data :---:|:---:|:---: data/LibriSpeech_debug | LibriSpeech | download data/LRS3_debug | LRS3 | download
Model Checkpoints
Path | Train Dataset | Model :---:|:---:|:---: ckpts/AlignDiT_pretrain_hifigan_16k_LibriSpeech_notext/model_500000.pt | LibriSpeech | download ckpts/AlignDiT_finetune_hifigan_16k_LRS3_char/model_400000.pt | LRS3 | download
Test Samples
We provide audio samples generated by AlignDiT. For VTS task, we use a lip reading model (Auto-AVSR) to transcribe text from the silent video before inference.
Task | Test Dataset | WER ↓ | AVSync ↑ | spkSIM ↑ | Samples :---:|:---:|:---:|:---:|:---:|:---:| ADR (automated dialogue replacement) | LRS3-cross | 1.401 | 0.751 | 0.515 | download VTS (video-to-speech synthesis) | LRS3-cross | 19.513 | 0.688 | 0.508 | download
<br>1. Installation
conda create -y -n aligndit python=3.10 && conda activate aligndit
git clone https://github.com/kaistmm/AlignDiT.git && cd AlignDiT
pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -e .
pip install -e .[eval] # For evaluation
2. Data Preparation
We crop the mouth region from each video following Auto-AVSR and place the resulting videos in data/LRS3_debug/autoavsr/video. For both training and inference, we use these cropped videos.
Metadata
bash src/aligndit/run/misc/prepare_librispeech_notext.sh
bash src/aligndit/run/misc/prepare_lrs3.sh
Mel spectrogram
bash src/aligndit/run/misc/extract_mel.sh
HuBERT feature
bash src/aligndit/run/misc/extract_hubert.sh
AV-HuBERT video feature
This requires Fairseq and AV-HuBERT.
bash src/aligndit/run/misc/extract_avhubert_from_only_video.sh
3. Training
# 1. Pre-train on LibriSpeech for 500k updates
bash src/aligndit/run/train/pretrain.sh
# 2. Fine-tune on LRS3 for 400k updates
bash src/aligndit/run/train/finetune.sh
4. Inference
# ADR (automated dialogue replacement)
bash src/aligndit/run/eval/infer.sh
# VTS (video-to-speech synthesis)
bash src/aligndit/run/eval/infer_w_lipreader.sh
5. Evaluation
We follow F5-TTS for evaluation. Further details are avilable here.
# For AVSync metric, run this script beforehand
bash src/aligndit/run/misc/extract_avhubert.sh
bash src/aligndit/run/eval/eval_lrs3_test.sh
<br>
Acknowledgement
This repository is built using F5-TTS, AV-HuBERT, Fairseq, CosyVoice, HiFi-GAN, V2SFlow. We appreciate the open source of the projects.
Citation
If our work is useful for you, please cite the following paper:
@inproceedings{choi2025aligndit,
title={AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation},
author={Choi, Jeongsoo and Kim, Ji-Hoon and Sung-Bin, Kim and Oh, Tae-Hyun and Chung, Joon Son},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
year={2025}
}
License
This project is released under the MIT License. Please note that the use of AV-HuBERT models is subject to their original license terms.
Related Skills
node-connect
343.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
92.1kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
343.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
343.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
