SkillAgentSearch skills...

AlignDiT

[ACM MM 2025] AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation

Install / Use

/learn @kaistmm/AlignDiT
About this skill

Quality Score

0/100

Supported Platforms

Zed

README

AlignDiT

Official PyTorch implementation for the following paper:

AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation<br> Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, Joon Son Chung<br> ACM MM 2025<br> [Paper] [Project]

<div align="center"><img width="80%" src="https://mm.kaist.ac.kr/projects/AlignDiT/imgs/fig1.png?raw=true"/></div>

Datasets

To help verify your setup and ensure reproducibility, we provide the following debug data.

Path | Dataset | Debug data :---:|:---:|:---: data/LibriSpeech_debug | LibriSpeech | download data/LRS3_debug | LRS3 | download

Model Checkpoints

Path | Train Dataset | Model :---:|:---:|:---: ckpts/AlignDiT_pretrain_hifigan_16k_LibriSpeech_notext/model_500000.pt | LibriSpeech | download ckpts/AlignDiT_finetune_hifigan_16k_LRS3_char/model_400000.pt | LRS3 | download

Test Samples

We provide audio samples generated by AlignDiT. For VTS task, we use a lip reading model (Auto-AVSR) to transcribe text from the silent video before inference.

Task | Test Dataset | WER ↓ | AVSync ↑ | spkSIM ↑ | Samples :---:|:---:|:---:|:---:|:---:|:---:| ADR (automated dialogue replacement) | LRS3-cross | 1.401 | 0.751 | 0.515 | download VTS (video-to-speech synthesis) | LRS3-cross | 19.513 | 0.688 | 0.508 | download

<br>

1. Installation

conda create -y -n aligndit python=3.10 && conda activate aligndit

git clone https://github.com/kaistmm/AlignDiT.git && cd AlignDiT

pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -e .
pip install -e .[eval]  # For evaluation

2. Data Preparation

We crop the mouth region from each video following Auto-AVSR and place the resulting videos in data/LRS3_debug/autoavsr/video. For both training and inference, we use these cropped videos.

Metadata

bash src/aligndit/run/misc/prepare_librispeech_notext.sh
bash src/aligndit/run/misc/prepare_lrs3.sh

Mel spectrogram

bash src/aligndit/run/misc/extract_mel.sh

HuBERT feature

bash src/aligndit/run/misc/extract_hubert.sh

AV-HuBERT video feature

This requires Fairseq and AV-HuBERT.

bash src/aligndit/run/misc/extract_avhubert_from_only_video.sh

3. Training

# 1. Pre-train on LibriSpeech for 500k updates
bash src/aligndit/run/train/pretrain.sh

# 2. Fine-tune on LRS3 for 400k updates
bash src/aligndit/run/train/finetune.sh

4. Inference

# ADR (automated dialogue replacement)
bash src/aligndit/run/eval/infer.sh

# VTS (video-to-speech synthesis)
bash src/aligndit/run/eval/infer_w_lipreader.sh

5. Evaluation

We follow F5-TTS for evaluation. Further details are avilable here.

# For AVSync metric, run this script beforehand
bash src/aligndit/run/misc/extract_avhubert.sh

bash src/aligndit/run/eval/eval_lrs3_test.sh
<br>

Acknowledgement

This repository is built using F5-TTS, AV-HuBERT, Fairseq, CosyVoice, HiFi-GAN, V2SFlow. We appreciate the open source of the projects.

Citation

If our work is useful for you, please cite the following paper:

@inproceedings{choi2025aligndit,
  title={AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation},
  author={Choi, Jeongsoo and Kim, Ji-Hoon and Sung-Bin, Kim and Oh, Tae-Hyun and Chung, Joon Son},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  year={2025}
}

License

This project is released under the MIT License. Please note that the use of AV-HuBERT models is subject to their original license terms.

Related Skills

View on GitHub
GitHub Stars24
CategoryDevelopment
Updated16d ago
Forks2

Languages

Python

Security Score

90/100

Audited on Mar 15, 2026

No findings