EzAudio
High-quality Text-to-Audio Generation with Efficient Diffusion Transformer
Install / Use
/learn @haidog-yaqub/EzAudioREADME
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer
🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.
🎛 Play with EzAudio for text-to-audio generation, editing, and inpainting: EzAudio Space
🎮 EzAudio-ControlNet Demo is available: EzAudio-ControlNet Space
<!-- We want to thank Hugging Face Space and Gradio for providing incredible demo platform. -->News
<!-- - [SynSonic](https://github.com/JHU-LCAP/SynSonic ), leveraging EzAudio and ControlNet for sound event detection (SED) data augmentation, was accepted to WASPAA 2025. -->- 2025.05 EzAudio has been accepted for an oral presentation at Interspeech 2025.
Installation
Clone the repository:
git clone git@github.com:haidog-yaqub/EzAudio.git
Install the dependencies:
cd EzAudio
pip install -r requirements.txt
Download checkponts (Optional): https://huggingface.co/OpenSound/EzAudio
Usage
You can use the model with the following code:
from api.ezaudio import EzAudio
import torch
import soundfile as sf
# load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ezaudio = EzAudio(model_name='s3_xl', device=device)
# text to audio genertation
prompt = "a dog barking in the distance"
sr, audio = ezaudio.generate_audio(prompt)
sf.write(f'{prompt}.wav', audio, sr)
# audio inpainting
prompt = "A train passes by, blowing its horns"
original_audio = 'egs/edit_example.wav'
sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
mask_start=1, mask_length=5)
sf.write(f'{prompt}_edit.wav', audio, sr)
ControlNet Usage:
from api.ezaudio import EzAudio
import torch
import soundfile as sf
# load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
controlnet = EzAudio_ControlNet(model_name='energy', device=device)
prompt = 'dog barking'
# path for audio reference
audio_path = 'egs/reference.mp3'
sr, audio = controlnet.generate_audio(prompt, audio_path=audio_path)
sf.write(f"{prompt}_control.wav", audio, samplerate=sr)
Training
Autoencoder
Refer to the VAE training section in our work SoloAudio
T2A Diffusion Model
Prepare your data (see example in src/dataset/meta_example.csv), then run:
cd src
accelerate launch train.py
Todo
- [x] Release Gradio Demo along with checkpoints EzAudio Space
- [x] Release ControlNet Demo along with checkpoints EzAudio ControlNet Space
- [x] Release inference code
- [x] Release training pipeline and dataset
- [x] Improve API and support automatic ckpts downloading
Reference
If you find the code useful for your research, please consider citing:
@article{hai2024ezaudio,
title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
journal={arXiv preprint arXiv:2409.10819},
year={2024}
}
Acknowledgement
Some codes are borrowed from or inspired by: U-Vit, Pixel-Art, Huyuan-DiT, and Stable Audio.
Related Skills
node-connect
354.3kDiagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps
frontend-design
112.3kCreate distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.
openai-whisper-api
354.3kTranscribe audio via OpenAI Audio Transcriptions API (Whisper).
qqbot-media
354.3kQQBot 富媒体收发能力。使用 <qqmedia> 标签,系统根据文件扩展名自动识别类型(图片/语音/视频/文件)。
