SkillAgentSearch skills...

EzAudio

High-quality Text-to-Audio Generation with Efficient Diffusion Transformer

Install / Use

/learn @haidog-yaqub/EzAudio
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<img src="arts/ezaudio.png">

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer

Official Page arXiv Hugging Face Models

🟣 EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands.

🎛 Play with EzAudio for text-to-audio generation, editing, and inpainting: EzAudio Space

🎮 EzAudio-ControlNet Demo is available: EzAudio-ControlNet Space

<!-- We want to thank Hugging Face Space and Gradio for providing incredible demo platform. -->

News

<!-- - [SynSonic](https://github.com/JHU-LCAP/SynSonic ), leveraging EzAudio and ControlNet for sound event detection (SED) data augmentation, was accepted to WASPAA 2025. -->
  • 2025.05 EzAudio has been accepted for an oral presentation at Interspeech 2025.

Installation

Clone the repository:

git clone git@github.com:haidog-yaqub/EzAudio.git

Install the dependencies:

cd EzAudio
pip install -r requirements.txt

Download checkponts (Optional): https://huggingface.co/OpenSound/EzAudio

Usage

You can use the model with the following code:

from api.ezaudio import EzAudio
import torch
import soundfile as sf

# load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
ezaudio = EzAudio(model_name='s3_xl', device=device)

# text to audio genertation
prompt = "a dog barking in the distance"
sr, audio = ezaudio.generate_audio(prompt)
sf.write(f'{prompt}.wav', audio, sr)

# audio inpainting
prompt = "A train passes by, blowing its horns"
original_audio = 'egs/edit_example.wav'
sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio,
                                  mask_start=1, mask_length=5)
sf.write(f'{prompt}_edit.wav', audio, sr)

ControlNet Usage:


from api.ezaudio import EzAudio
import torch
import soundfile as sf

# load model
device = 'cuda' if torch.cuda.is_available() else 'cpu'
controlnet = EzAudio_ControlNet(model_name='energy', device=device)

prompt = 'dog barking'
# path for audio reference
audio_path = 'egs/reference.mp3'

sr, audio = controlnet.generate_audio(prompt, audio_path=audio_path)
sf.write(f"{prompt}_control.wav", audio, samplerate=sr)

Training

Autoencoder

Refer to the VAE training section in our work SoloAudio

T2A Diffusion Model

Prepare your data (see example in src/dataset/meta_example.csv), then run:

cd src
accelerate launch train.py

Todo

  • [x] Release Gradio Demo along with checkpoints EzAudio Space
  • [x] Release ControlNet Demo along with checkpoints EzAudio ControlNet Space
  • [x] Release inference code
  • [x] Release training pipeline and dataset
  • [x] Improve API and support automatic ckpts downloading

Reference

If you find the code useful for your research, please consider citing:

@article{hai2024ezaudio,
  title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer},
  author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong},
  journal={arXiv preprint arXiv:2409.10819},
  year={2024}
}

Acknowledgement

Some codes are borrowed from or inspired by: U-Vit, Pixel-Art, Huyuan-DiT, and Stable Audio.

Related Skills

View on GitHub
GitHub Stars331
CategoryDevelopment
Updated7d ago
Forks25

Languages

Python

Security Score

100/100

Audited on Apr 3, 2026

No findings