TaDiCodec

This repository contains a series of works on diffusion-based speech tokenizers, including the official implementation of the paper: "TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling [NeurIPS 2025]" https://arxiv.org/pdf/2508.16790

Generate Convert Improve

Install / Use

/learn @AmphionTeam/TaDiCodec

About this skill

Quality Score

0/100

README

🎵 Diffusion-Speech-Tokenizer 🚀

🔬 Official PyTorch Implementation of TaDiCodec

📄 Paper: TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling [NeurIPS 2025]

</div>

📋 Overview

This repository is designed to provide comprehensive implementations for our series of diffusion-based speech tokenizer research works. Currently, it primarily features TaDiCodec, with plans to include additional in-progress works in the future. Specifically, the repository includes:

🧠 A simple PyTorch implementation of the TaDiCodec tokenizer
🎯 Token-based zero-shot TTS models based on TaDiCodec:
- 🤖 Autoregressive based TTS models
- 🌊 Masked diffusion (a.k.a. Masked Genrative Model (MGM) based TTS models
🏋️ Training scripts for tokenizer and TTS models
🤗 Hugging Face and 🔮 ModelScope (to be updated) for easy access to pre-trained models

Short Intro on TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling:

We introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a novel approach to speech tokenization that employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of 0.0875 kbps with a single-layer codebook for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS).

📢 News & Updates

🔥 Latest Updates 🔥

</div>

🎉 [2025-09-19] TaDiCodec is accepted by NeurIPS 2025!
🚀 [2025-08-25] We release the offical implementation of TaDiCodec and the TTS models based on TaDiCodec.
🔥 [2025-08-25] TaDiCodec paper released! Check out our arXiv preprint
📦 [2025-08-25] Added auto-download functionality from Hugging Face for all models!

🚧 Development Roadmap & TODO List

🔥 Current Status: Active Development 🔥

This project is under active development. Check back frequently for updates!

</div>

🎯 Core TaDiCodec Implementation

[x] 🏗️ Repository Structure Setup
[x] 📝 Documentation Framework
[x] 🧠 TaDiCodec Model Architecture
- [x] NAR Llama-style transformers for encoder and decoder architectures
- [x] text-aware flow matching (diffusion) decoder
- [x] vocoder for mel2wav
[x] ⚡ Inference Pipeline
- [x] Basic inference pipeline
- [x] Auto-download from Hugging Face
- [ ] Add auto-ASR for text input

🎓 Training Infrastructure

[ ] 🏋️ TaDiCodec Training Scripts
[ ] 💾 Dataset and Dataloader

🎤 Text-to-Speech Models

[ ] 🤖 Autoregressive Models
- [x] Model architecture
- [x] Pre-training models loading and inference
- [ ] Training scripts
[ ] 🌊 Masked Diffusion Models
- [x] Model architecture
- [x] Pre-training models loading and inference
- [ ] Training scripts

📊 Evaluation

[ ] Add evaluation scripts

🪐 Future Works

[ ] 🛸 Diffusion-based Speech Tokenizer without text conditioning

🤗 Pre-trained Models

📦 Model Zoo - Ready to Use!

Download our pre-trained models for instant inference

🎵 TaDiCodec

| Model | 🤗 Hugging Face | 👷 Status | |:-----:|:---------------:|:------:| | 🚀 TaDiCodec | | ✅ | | 🚀 TaDiCodec-old | | 🚧 |

Note: TaDiCodec-old is the old version of TaDiCodec, the TaDiCodec-TTS-AR-Phi-3.5-4B is based on TaDiCodec-old.

🎤 TTS Models

| Model | Type | LLM | 🤗 Hugging Face | 👷 Status | |:-----:|:----:|:---:|:---------------:|:-------------:| | 🤖 TaDiCodec-TTS-AR-Qwen2.5-0.5B | AR | Qwen2.5-0.5B-Instruct | | ✅ | | 🤖 TaDiCodec-TTS-AR-Qwen2.5-3B | AR | Qwen2.5-3B-Instruct | | ✅ | | 🤖 TaDiCodec-TTS-AR-Phi-3.5-4B | AR | Phi-3.5-mini-instruct | | 🚧 | | 🌊 TaDiCodec-TTS-MGM | MGM | - | | ✅ |

[ ] ModelScope will be updated soon.

🔧 Quick Model Usage

# 🤗 Load from Hugging Face with Auto-Download
from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
from models.tts.llm_tts.inference_mgm_tts import MGMInferencePipeline

# Load TaDiCodec tokenizer (auto-downloads from HF if not found locally)
tokenizer = TaDiCodecPipline.from_pretrained("amphion/TaDiCodec")

# Load AR TTS model (auto-downloads from HF if not found locally)
tts_model = TTSInferencePipeline.from_pretrained(
    tadicodec_path="amphion/TaDiCodec",
    llm_path="amphion/TaDiCodec-TTS-AR-Qwen2.5-0.5B"
)

# Load MGM TTS model (auto-downloads from HF if not found locally)
mgm_model = MGMInferencePipeline.from_pretrained(
    tadicodec_path="amphion/TaDiCodec",
    mgm_path="amphion/TaDiCodec-TTS-MGM-0.6B"
)

# You can also use local paths if you have models downloaded
# tts_model = TTSInferencePipeline.from_pretrained(
#     tadicodec_path="./ckpt/TaDiCodec",
#     llm_path="./ckpt/TaDiCodec-TTS-AR-Qwen2.5-0.5B"
# )

🚀 Quick Start

Installation

Conda Linux

Select one of the 2 pytorch lines depending on your hardware

# Clone the repository
git clone https://github.com/AmphionTeam/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer

# Install dependencies
conda create -n tadicodec python=3.10
conda activate tadicodec
pip install setuptools wheel psutil packaging ninja numpy hf_xet

# pytorch
# CUDA
pip install torch==2.8.0 torchaudio --index-strategy unsafe-best-match --extra-index-url https://download.pytorch.org/whl/cu128
# OR CPU only
pip install torch==2.8.0 torchaudio

pip install flash_attn==2.7.4.post1
pip install -r requirements.txt

Conda Windows

This assumes you are using powershell Select one of the 2 pytorch lines depending on your hardware Select one of the 2 flash_attn sections on if you want to use a pre-build whl or to compile your own

# Clone the repository
git clone https://github.com/AmphionTeam/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer

# Install dependencies
conda create -n tadicodec python=3.10
conda activate tadicodec
pip install setuptools wheel psutil packaging ninja numpy hf_xet

# pytorch
# CUDA
pip install torch==2.8.0 torchaudio --index-strategy unsafe-best-match --extra-index-url https://download.pytorch.org/whl/cu128
# OR CPU only
pip install torch==2.8.0 torchaudio

# flash_attn
# use a pre-built wheel
pip install https://huggingface.co/kim512/flash_attn-2.7.4.post1/resolve/main/flash_attn-2.7.4.post1-cu128-torch2.8.0-cp310-cp310-win_amd64.whl
# OR compile your own, change MAX_JOBS to match your CPU, ideally 4 to 8. If you have lots of RAM make this number smaller.
$Env:MAX_JOBS="6"
$Env:CUDA_PATH="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"
pip install -v flash-attn==2.7.4.post1 --no-build-isolation


# install requirements
pip install -r requirements.txt

UV Linux

Select one of the 2 pytorch lines depending on your hardware

# Clone the repository
git clone https://github.com/AmphionTeam/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer

# Install python and dependencies
uv python install 3.10
uv venv --python 3.10
uv pip install setuptools wheel psutil packaging ninja numpy hf_xet

# pytorch
#

Related Skills

node-connect

352.9k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

111.5k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

352.9k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

352.9k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。