TaDiCodec
This repository contains a series of works on diffusion-based speech tokenizers, including the official implementation of the paper: "TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling [NeurIPS 2025]" https://arxiv.org/pdf/2508.16790
Install / Use
/learn @AmphionTeam/TaDiCodecREADME
🎵 Diffusion-Speech-Tokenizer 🚀
<img src="https://img.shields.io/badge/🔥-TaDiCodec-red?style=for-the-badge" alt="TaDiCodec"/> <img src="https://img.shields.io/badge/🎯-Text--aware-blue?style=for-the-badge" alt="Text-aware"/> <img src="https://img.shields.io/badge/🌊-Diffusion-purple?style=for-the-badge" alt="Diffusion"/> <img src="https://img.shields.io/badge/🗣️-Speech-green?style=for-the-badge" alt="Speech"/>🔬 Official PyTorch Implementation of TaDiCodec
<!-- *A series of works on diffusion-based speech tokenizers* --> <!-- TODO: replace the paper link to the arXiv link -->📄 Paper: TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling [NeurIPS 2025]
<!-- [](https://modelscope.cn/models/amphion/TaDiCodec) --> </div>📋 Overview
This repository is designed to provide comprehensive implementations for our series of diffusion-based speech tokenizer research works. Currently, it primarily features TaDiCodec, with plans to include additional in-progress works in the future. Specifically, the repository includes:
- 🧠 A simple PyTorch implementation of the TaDiCodec tokenizer
- 🎯 Token-based zero-shot TTS models based on TaDiCodec:
- 🤖 Autoregressive based TTS models
- 🌊 Masked diffusion (a.k.a. Masked Genrative Model (MGM) based TTS models
- 🏋️ Training scripts for tokenizer and TTS models
- 🤗 Hugging Face and 🔮 ModelScope (to be updated) for easy access to pre-trained models
Short Intro on TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling:
We introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a novel approach to speech tokenization that employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of 0.0875 kbps with a single-layer codebook for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS).
<!-- Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and obviating the need for auxiliary pre-trained models. -->📢 News & Updates
<div align="center">🔥 Latest Updates 🔥
</div>- 🎉 [2025-09-19] TaDiCodec is accepted by NeurIPS 2025!
- 🚀 [2025-08-25] We release the offical implementation of TaDiCodec and the TTS models based on TaDiCodec.
- 🔥 [2025-08-25] TaDiCodec paper released! Check out our arXiv preprint
- 📦 [2025-08-25] Added auto-download functionality from Hugging Face for all models!
🚧 Development Roadmap & TODO List
<div align="center">🔥 Current Status: Active Development 🔥
This project is under active development. Check back frequently for updates!
</div>🎯 Core TaDiCodec Implementation
- [x] 🏗️ Repository Structure Setup
- [x] 📝 Documentation Framework
- [x] 🧠 TaDiCodec Model Architecture
- [x] NAR Llama-style transformers for encoder and decoder architectures
- [x] text-aware flow matching (diffusion) decoder
- [x] vocoder for mel2wav
- [x] ⚡ Inference Pipeline
- [x] Basic inference pipeline
- [x] Auto-download from Hugging Face
- [ ] Add auto-ASR for text input
🎓 Training Infrastructure
- [ ] 🏋️ TaDiCodec Training Scripts
- [ ] 💾 Dataset and Dataloader
🎤 Text-to-Speech Models
- [ ] 🤖 Autoregressive Models
- [x] Model architecture
- [x] Pre-training models loading and inference
- [ ] Training scripts
- [ ] 🌊 Masked Diffusion Models
- [x] Model architecture
- [x] Pre-training models loading and inference
- [ ] Training scripts
📊 Evaluation
- [ ] Add evaluation scripts
🪐 Future Works
- [ ] 🛸 Diffusion-based Speech Tokenizer without text conditioning
🤗 Pre-trained Models
📦 Model Zoo - Ready to Use!
Download our pre-trained models for instant inference
🎵 TaDiCodec
| Model | 🤗 Hugging Face | 👷 Status |
|:-----:|:---------------:|:------:|
| 🚀 TaDiCodec | | ✅ |
| 🚀 TaDiCodec-old |
| 🚧 |
Note: TaDiCodec-old is the old version of TaDiCodec, the TaDiCodec-TTS-AR-Phi-3.5-4B is based on TaDiCodec-old.
🎤 TTS Models
| Model | Type | LLM | 🤗 Hugging Face | 👷 Status |
|:-----:|:----:|:---:|:---------------:|:-------------:|
| 🤖 TaDiCodec-TTS-AR-Qwen2.5-0.5B | AR | Qwen2.5-0.5B-Instruct | | ✅ |
| 🤖 TaDiCodec-TTS-AR-Qwen2.5-3B | AR | Qwen2.5-3B-Instruct |
| ✅ |
| 🤖 TaDiCodec-TTS-AR-Phi-3.5-4B | AR | Phi-3.5-mini-instruct |
| 🚧 |
| 🌊 TaDiCodec-TTS-MGM | MGM | - |
| ✅ |
- [ ] ModelScope will be updated soon.
🔧 Quick Model Usage
# 🤗 Load from Hugging Face with Auto-Download
from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
from models.tts.llm_tts.inference_mgm_tts import MGMInferencePipeline
# Load TaDiCodec tokenizer (auto-downloads from HF if not found locally)
tokenizer = TaDiCodecPipline.from_pretrained("amphion/TaDiCodec")
# Load AR TTS model (auto-downloads from HF if not found locally)
tts_model = TTSInferencePipeline.from_pretrained(
tadicodec_path="amphion/TaDiCodec",
llm_path="amphion/TaDiCodec-TTS-AR-Qwen2.5-0.5B"
)
# Load MGM TTS model (auto-downloads from HF if not found locally)
mgm_model = MGMInferencePipeline.from_pretrained(
tadicodec_path="amphion/TaDiCodec",
mgm_path="amphion/TaDiCodec-TTS-MGM-0.6B"
)
# You can also use local paths if you have models downloaded
# tts_model = TTSInferencePipeline.from_pretrained(
# tadicodec_path="./ckpt/TaDiCodec",
# llm_path="./ckpt/TaDiCodec-TTS-AR-Qwen2.5-0.5B"
# )
🚀 Quick Start
Installation
Conda Linux
Select one of the 2 pytorch lines depending on your hardware
# Clone the repository
git clone https://github.com/AmphionTeam/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer
# Install dependencies
conda create -n tadicodec python=3.10
conda activate tadicodec
pip install setuptools wheel psutil packaging ninja numpy hf_xet
# pytorch
# CUDA
pip install torch==2.8.0 torchaudio --index-strategy unsafe-best-match --extra-index-url https://download.pytorch.org/whl/cu128
# OR CPU only
pip install torch==2.8.0 torchaudio
pip install flash_attn==2.7.4.post1
pip install -r requirements.txt
Conda Windows
This assumes you are using powershell Select one of the 2 pytorch lines depending on your hardware Select one of the 2 flash_attn sections on if you want to use a pre-build whl or to compile your own
# Clone the repository
git clone https://github.com/AmphionTeam/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer
# Install dependencies
conda create -n tadicodec python=3.10
conda activate tadicodec
pip install setuptools wheel psutil packaging ninja numpy hf_xet
# pytorch
# CUDA
pip install torch==2.8.0 torchaudio --index-strategy unsafe-best-match --extra-index-url https://download.pytorch.org/whl/cu128
# OR CPU only
pip install torch==2.8.0 torchaudio
# flash_attn
# use a pre-built wheel
pip install https://huggingface.co/kim512/flash_attn-2.7.4.post1/resolve/main/flash_attn-2.7.4.post1-cu128-torch2.8.0-cp310-cp310-win_amd64.whl
# OR compile your own, change MAX_JOBS to match your CPU, ideally 4 to 8. If you have lots of RAM make this number smaller.
$Env:MAX_JOBS="6"
$Env:CUDA_PATH="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"
pip install -v flash-attn==2.7.4.post1 --no-build-isolation
# install requirements
pip install -r requirements.txt
UV Linux
Select one of the 2 pytorch lines depending on your hardware
# Clone the repository
git clone https://github.com/AmphionTeam/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer
# Install python and dependencies
uv python install 3.10
uv venv --python 3.10
uv pip install setuptools wheel psutil packaging ninja numpy hf_xet
# pytorch
#
Related Skills
node-connect
352.9kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
111.5kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
352.9kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
352.9kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
