SkillAgentSearch skills...

TaDiCodec

This repository contains a series of works on diffusion-based speech tokenizers, including the official implementation of the paper: "TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling [NeurIPS 2025]" https://arxiv.org/pdf/2508.16790

Install / Use

/learn @AmphionTeam/TaDiCodec
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<div align="center">

🎵 Diffusion-Speech-Tokenizer 🚀

<img src="https://img.shields.io/badge/🔥-TaDiCodec-red?style=for-the-badge" alt="TaDiCodec"/> <img src="https://img.shields.io/badge/🎯-Text--aware-blue?style=for-the-badge" alt="Text-aware"/> <img src="https://img.shields.io/badge/🌊-Diffusion-purple?style=for-the-badge" alt="Diffusion"/> <img src="https://img.shields.io/badge/🗣️-Speech-green?style=for-the-badge" alt="Speech"/>

🔬 Official PyTorch Implementation of TaDiCodec

<!-- *A series of works on diffusion-based speech tokenizers* --> <!-- TODO: replace the paper link to the arXiv link -->

📄 Paper: TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling [NeurIPS 2025]

GitHub Stars arXiv Demo Python PyTorch Hugging Face

<!-- [![ModelScope](https://img.shields.io/badge/🔮%20ModelScope-tadicodec-blue)](https://modelscope.cn/models/amphion/TaDiCodec) --> </div>

📋 Overview

This repository is designed to provide comprehensive implementations for our series of diffusion-based speech tokenizer research works. Currently, it primarily features TaDiCodec, with plans to include additional in-progress works in the future. Specifically, the repository includes:

  • 🧠 A simple PyTorch implementation of the TaDiCodec tokenizer
  • 🎯 Token-based zero-shot TTS models based on TaDiCodec:
  • 🏋️ Training scripts for tokenizer and TTS models
  • 🤗 Hugging Face and 🔮 ModelScope (to be updated) for easy access to pre-trained models

Short Intro on TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language Modeling:

We introduce the Text-aware Diffusion Transformer Speech Codec (TaDiCodec), a novel approach to speech tokenization that employs end-to-end optimization for quantization and reconstruction through a diffusion autoencoder, while integrating text guidance into the diffusion decoder to enhance reconstruction quality and achieve optimal compression. TaDiCodec achieves an extremely low frame rate of 6.25 Hz and a corresponding bitrate of 0.0875 kbps with a single-layer codebook for 24 kHz speech, while maintaining superior performance on critical speech generation evaluation metrics such as Word Error Rate (WER), speaker similarity (SIM), and speech quality (UTMOS).

<!-- Notably, TaDiCodec employs a single-stage, end-to-end training paradigm, and obviating the need for auxiliary pre-trained models. -->

📢 News & Updates

<div align="center">

🔥 Latest Updates 🔥

</div>
  • 🎉 [2025-09-19] TaDiCodec is accepted by NeurIPS 2025!
  • 🚀 [2025-08-25] We release the offical implementation of TaDiCodec and the TTS models based on TaDiCodec.
  • 🔥 [2025-08-25] TaDiCodec paper released! Check out our arXiv preprint
  • 📦 [2025-08-25] Added auto-download functionality from Hugging Face for all models!

🚧 Development Roadmap & TODO List

<div align="center">

🔥 Current Status: Active Development 🔥

This project is under active development. Check back frequently for updates!

</div>

🎯 Core TaDiCodec Implementation

  • [x] 🏗️ Repository Structure Setup
  • [x] 📝 Documentation Framework
  • [x] 🧠 TaDiCodec Model Architecture
    • [x] NAR Llama-style transformers for encoder and decoder architectures
    • [x] text-aware flow matching (diffusion) decoder
    • [x] vocoder for mel2wav
  • [x] ⚡ Inference Pipeline
    • [x] Basic inference pipeline
    • [x] Auto-download from Hugging Face
    • [ ] Add auto-ASR for text input

🎓 Training Infrastructure

  • [ ] 🏋️ TaDiCodec Training Scripts
  • [ ] 💾 Dataset and Dataloader

🎤 Text-to-Speech Models

  • [ ] 🤖 Autoregressive Models
    • [x] Model architecture
    • [x] Pre-training models loading and inference
    • [ ] Training scripts
  • [ ] 🌊 Masked Diffusion Models
    • [x] Model architecture
    • [x] Pre-training models loading and inference
    • [ ] Training scripts

📊 Evaluation

  • [ ] Add evaluation scripts

🪐 Future Works

  • [ ] 🛸 Diffusion-based Speech Tokenizer without text conditioning

🤗 Pre-trained Models

📦 Model Zoo - Ready to Use!

Download our pre-trained models for instant inference

🎵 TaDiCodec

| Model | 🤗 Hugging Face | 👷 Status | |:-----:|:---------------:|:------:| | 🚀 TaDiCodec | HF | ✅ | | 🚀 TaDiCodec-old | HF | 🚧 |

Note: TaDiCodec-old is the old version of TaDiCodec, the TaDiCodec-TTS-AR-Phi-3.5-4B is based on TaDiCodec-old.

🎤 TTS Models

| Model | Type | LLM | 🤗 Hugging Face | 👷 Status | |:-----:|:----:|:---:|:---------------:|:-------------:| | 🤖 TaDiCodec-TTS-AR-Qwen2.5-0.5B | AR | Qwen2.5-0.5B-Instruct | HF | ✅ | | 🤖 TaDiCodec-TTS-AR-Qwen2.5-3B | AR | Qwen2.5-3B-Instruct | HF | ✅ | | 🤖 TaDiCodec-TTS-AR-Phi-3.5-4B | AR | Phi-3.5-mini-instruct | HF | 🚧 | | 🌊 TaDiCodec-TTS-MGM | MGM | - | HF | ✅ |

  • [ ] ModelScope will be updated soon.

🔧 Quick Model Usage

# 🤗 Load from Hugging Face with Auto-Download
from models.tts.tadicodec.inference_tadicodec import TaDiCodecPipline
from models.tts.llm_tts.inference_llm_tts import TTSInferencePipeline
from models.tts.llm_tts.inference_mgm_tts import MGMInferencePipeline

# Load TaDiCodec tokenizer (auto-downloads from HF if not found locally)
tokenizer = TaDiCodecPipline.from_pretrained("amphion/TaDiCodec")

# Load AR TTS model (auto-downloads from HF if not found locally)
tts_model = TTSInferencePipeline.from_pretrained(
    tadicodec_path="amphion/TaDiCodec",
    llm_path="amphion/TaDiCodec-TTS-AR-Qwen2.5-0.5B"
)

# Load MGM TTS model (auto-downloads from HF if not found locally)
mgm_model = MGMInferencePipeline.from_pretrained(
    tadicodec_path="amphion/TaDiCodec",
    mgm_path="amphion/TaDiCodec-TTS-MGM-0.6B"
)

# You can also use local paths if you have models downloaded
# tts_model = TTSInferencePipeline.from_pretrained(
#     tadicodec_path="./ckpt/TaDiCodec",
#     llm_path="./ckpt/TaDiCodec-TTS-AR-Qwen2.5-0.5B"
# )

🚀 Quick Start

Installation

Conda Linux

Select one of the 2 pytorch lines depending on your hardware

# Clone the repository
git clone https://github.com/AmphionTeam/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer

# Install dependencies
conda create -n tadicodec python=3.10
conda activate tadicodec
pip install setuptools wheel psutil packaging ninja numpy hf_xet

# pytorch
# CUDA
pip install torch==2.8.0 torchaudio --index-strategy unsafe-best-match --extra-index-url https://download.pytorch.org/whl/cu128
# OR CPU only
pip install torch==2.8.0 torchaudio

pip install flash_attn==2.7.4.post1
pip install -r requirements.txt

Conda Windows

This assumes you are using powershell Select one of the 2 pytorch lines depending on your hardware Select one of the 2 flash_attn sections on if you want to use a pre-build whl or to compile your own

# Clone the repository
git clone https://github.com/AmphionTeam/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer

# Install dependencies
conda create -n tadicodec python=3.10
conda activate tadicodec
pip install setuptools wheel psutil packaging ninja numpy hf_xet

# pytorch
# CUDA
pip install torch==2.8.0 torchaudio --index-strategy unsafe-best-match --extra-index-url https://download.pytorch.org/whl/cu128
# OR CPU only
pip install torch==2.8.0 torchaudio

# flash_attn
# use a pre-built wheel
pip install https://huggingface.co/kim512/flash_attn-2.7.4.post1/resolve/main/flash_attn-2.7.4.post1-cu128-torch2.8.0-cp310-cp310-win_amd64.whl
# OR compile your own, change MAX_JOBS to match your CPU, ideally 4 to 8. If you have lots of RAM make this number smaller.
$Env:MAX_JOBS="6"
$Env:CUDA_PATH="C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.8"
pip install -v flash-attn==2.7.4.post1 --no-build-isolation


# install requirements
pip install -r requirements.txt

UV Linux

Select one of the 2 pytorch lines depending on your hardware

# Clone the repository
git clone https://github.com/AmphionTeam/Diffusion-Speech-Tokenizer.git
cd Diffusion-Speech-Tokenizer

# Install python and dependencies
uv python install 3.10
uv venv --python 3.10
uv pip install setuptools wheel psutil packaging ninja numpy hf_xet

# pytorch
#

Related Skills

View on GitHub
GitHub Stars76
CategoryDevelopment
Updated1mo ago
Forks3

Languages

Python

Security Score

80/100

Audited on Feb 28, 2026

No findings