AlignDiT

[ACM MM 2025] AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation

Generate Convert Improve

Install / Use

/learn @kaistmm/AlignDiT

About this skill

Quality Score

0/100

README

AlignDiT

Official PyTorch implementation for the following paper:

AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation<br> Jeongsoo Choi, Ji-Hoon Kim, Kim Sung-Bin, Tae-Hyun Oh, Joon Son Chung<br> ACM MM 2025<br> [Paper] [Project]

Datasets

To help verify your setup and ensure reproducibility, we provide the following debug data.

Model Checkpoints

Test Samples

We provide audio samples generated by AlignDiT. For VTS task, we use a lip reading model (Auto-AVSR) to transcribe text from the silent video before inference.

Task | Test Dataset | WER ↓ | AVSync ↑ | spkSIM ↑ | Samples :---:|:---:|:---:|:---:|:---:|:---:| ADR (automated dialogue replacement) | LRS3-cross | 1.401 | 0.751 | 0.515 | download VTS (video-to-speech synthesis) | LRS3-cross | 19.513 | 0.688 | 0.508 | download

<br>

1. Installation

conda create -y -n aligndit python=3.10 && conda activate aligndit

git clone https://github.com/kaistmm/AlignDiT.git && cd AlignDiT

pip install torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 --index-url https://download.pytorch.org/whl/cu121
pip install -e .
pip install -e .[eval]  # For evaluation

2. Data Preparation

We crop the mouth region from each video following Auto-AVSR and place the resulting videos in data/LRS3_debug/autoavsr/video. For both training and inference, we use these cropped videos.

Metadata

bash src/aligndit/run/misc/prepare_librispeech_notext.sh
bash src/aligndit/run/misc/prepare_lrs3.sh

Mel spectrogram

bash src/aligndit/run/misc/extract_mel.sh

HuBERT feature

bash src/aligndit/run/misc/extract_hubert.sh

AV-HuBERT video feature

This requires Fairseq and AV-HuBERT.

bash src/aligndit/run/misc/extract_avhubert_from_only_video.sh

3. Training

# 1. Pre-train on LibriSpeech for 500k updates
bash src/aligndit/run/train/pretrain.sh

# 2. Fine-tune on LRS3 for 400k updates
bash src/aligndit/run/train/finetune.sh

4. Inference

# ADR (automated dialogue replacement)
bash src/aligndit/run/eval/infer.sh

# VTS (video-to-speech synthesis)
bash src/aligndit/run/eval/infer_w_lipreader.sh

5. Evaluation

We follow F5-TTS for evaluation. Further details are avilable here.

# For AVSync metric, run this script beforehand
bash src/aligndit/run/misc/extract_avhubert.sh

bash src/aligndit/run/eval/eval_lrs3_test.sh

<br>

Acknowledgement

This repository is built using F5-TTS, AV-HuBERT, Fairseq, CosyVoice, HiFi-GAN, V2SFlow. We appreciate the open source of the projects.

Citation

If our work is useful for you, please cite the following paper:

@inproceedings{choi2025aligndit,
  title={AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation},
  author={Choi, Jeongsoo and Kim, Ji-Hoon and Sung-Bin, Kim and Oh, Tae-Hyun and Chung, Joon Son},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  year={2025}
}

License

This project is released under the MIT License. Please note that the use of AV-HuBERT models is subject to their original license terms.

Related Skills

node-connect

343.3k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

92.1k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

343.3k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

343.3k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。