AAR
[Official Implementation] Acoustic Autoregressive Modeling š„
Install / Use
/learn @qiuk2/AARREADME
Efficient Autoregressive Audio Modeling via Next-Scale Prediction
<div align="center"> </div> <p align="center" style="font-size: larger;"> <a href="https://arxiv.org/pdf/2408.09027">Efficient Autoregressive Audio Modeling via Next-Scale Prediction</a> </p> <p align="center"> <img src="assets/pipeline.png" width=95%> <p> <be>Updates
- (2024.08.24) Demo Released, tokenizer for other datasets will be available in two weeks.
- (2024.08.22) Add SAT and AAR code, demo will be released soon.
- (2024.08.20) Repo created. Code and checkpoints will be released this week.
Installation
- Install all packages via
pip3 install -r requirements.txt.
Dataset
We download our Audioset from the website https://research.google.com/audioset/ and collect it as
AudioSet
āāā audioset_unbalanced_train_mp3
āāā unbalanced_train_segments.csv
āāā audioset_eval_raw_mp3
Scale-level audio tokenizer (SAT)
We are currently training large-scale SAT for music, audio, and speech. We expect the checkpoint will be ready and released in Sept.
Training
python3 train_SAT_mpi.py --config config/train/SAT.yaml --train_dir /path/to/audioset_unbalanced_train_mp3 --train_csv /path/to/csv --batch_size $bs --gpus $gpus --output_dir /path/to/save/ckpt --use_prefetcher True --resume latest
Inference
python3 inference_SAT.py --config config/inference/SAT.yaml --resume /path/to/ckpt.pth --test_dir /path/to/audioset_eval_raw_mp3 --batch_size $bs
Pre-trained model
We provide Audioset pre-trained SAT checkpoint as follows: | model | # Scale | # Tokens |latent_dim| FAD | HF weights š¤ | |:----------:|:--------|:---------|:---------|:----|:-------------- | | SAT | 16 | 455 | 64 | 1.09|SAT.pth | | SAT | 16 | 455 | 128 | 1.40|SAT.pth |
Acoustic AutoRegressive Modeling (AAR)
Training
python3 train_AAR_mpi.py --config config/train/AAR.yaml --train_dir /path/to/audioset_unbalanced_train_mp3 --train_csv /path/to/csv --batch_size $bs --gpus $gpus --output_dir /path/to/save/ckpt --use_prefetcher True --resume latest --vqvae_pretrained_path /path/to/vae/ckpt --latent_dim $latent --dimension $dim
Inference
python3 inference_AAR.py --config config/inference/AAR.yaml --aar_pretrained_path /path/to/aar.pth --vqvae_pretrained_path /path/to/vqvae.pth --test_dir /path/to/audioset_eval_raw_mp3 --batch_size $bs --output_dir /path/to/save
Pre-trained model
We provide Audioset pre-trained AAR checkpoint as follows: | model | # Scale | # Tokens |latent_dim| FAD | HF weights š¤ | |:----------:|:--------|:---------|:---------|:----|:-------------- | | SAT | 16 | 455 | 128 | 1.40|SAT.pth | | AAR | 16 | 455 | 128 | 6.01|AAR.pth |
Citation
@misc{qiu2024efficient,
title={Efficient Autoregressive Audio Modeling via Next-Scale Prediction},
author={Kai Qiu and Xiang Li and Hao Chen and Jie Sun and Jinglu Wang and Zhe Lin and Marios Savvides and Bhiksha Raj},
year={2024},
eprint={2408.09027},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
Related Skills
node-connect
349.7kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
109.7kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
349.7kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
349.7kQQBot åÆåŖä½ę¶åč½åćä½æēØ <qqmedia> ę ē¾ļ¼ē³»ē»ę ¹ę®ęä»¶ę©å±åčŖåØčÆå«ē±»åļ¼å¾ē/čÆé³/č§é¢/ęä»¶ļ¼ć
