ALMTokenizer2
The open source code of ALMTokenizer2: Towards Low bit-rate and Semantic-rich Audio Tokenizer with Flow-based Scalar Diffusion Transformer Decoder
Install / Use
/learn @yangdongchao/ALMTokenizer2README
ALMTokenizer2: Towards Low bit-rate and Semantic-rich Audio Tokenizer with Flow-based Scalar Diffusion Transformer Decoder
This repository will release the source code and weight for our latest work, ALMTokenizer2. ALMTokenizer2 use the query-based quantization strategy to enhance the semantic information and reconstruction performance. Furthermore, it introudces a flow-based scalar diffusion transformer decoder to improve the reconstruction performance. Compared to ALMTokenizer, the experimental results show that ALMTokenizer2 significantly improves the reconstruction performance, especially for sound and music data.
The difference between ALMTokenizer
-
Instead of introducing multiple stage training strategy to improve the semantic information, we propose to use multiple semantic experts to extract the semantic features, then using query-based quantization for it.
-
We use diffusion loss to optimize the codec (discard the GAN-based training strategy).
-
In the released version, we do not apply the GPT loss for the codec, but the training code support to use it.
Training data
The total training data includes about 10k hours of speech, sound, and music data.
How to train the model
bash run.sh
How to infer the model
huggingface-cli download Dongchao/almtokenizer2 \
--local-dir ./Dongchao/almtokenizer2 \
--repo-type model
bash infer.sh
Performance
1. VCTK reconstruction
| Model | PESQ | STOI | MS-STFT loss | |:--------------------:|:----:|:-----:|:------------:| | ALMTokenizer (3 RVQ) | 2.0 | 0.81 | 1.78 | | ALMTokenizer (8 RVQ) | 2.63 | 0.86 | 1.57 | | MimiCodec (8RVQ) | 2.1 | 0.82 | 1.60 | | ALMTokeizer2 (8RVQ) | 2.99 | 0.86 | 1.44 |
Plan
Note that we will donot update this repo in recently. Because we have got more advanced codec (ReasoningCodec). If you take care of about the universal semantic codec, please refer to https://github.com/yangdongchao/UniAudio2
Citations
@inproceedings{yangalmtokenizer,
title={ALMTokenizer: A Low-bitrate and Semantic-rich Audio Codec Tokenizer for Audio Language Modeling},
author={Yang, Dongchao and Liu, Songxiang and Guo, Haohan and Zhao, Jiankun and Wang, Yuanyuan and Wang, Helin and Ju, Zeqian and Liu, Xubo and Chen, Xueyuan and Tan, Xu and others},
booktitle={Forty-second International Conference on Machine Learning}
}
Acknowledgement
Part of the code refers to MuCodec (https://github.com/tencent-ailab/MuCodec).
Related Skills
node-connect
350.8kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
110.4kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
350.8kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
350.8kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
