Instructions to use OpenSound/EzAudio with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use OpenSound/EzAudio with Diffusers:
pip install -U diffusers transformers accelerate
from diffusers import ControlNetModel, StableDiffusionControlNetPipeline controlnet = ControlNetModel.from_pretrained("OpenSound/EzAudio") pipe = StableDiffusionControlNetPipeline.from_pretrained( "fill-in-base-model", controlnet=controlnet ) - Notebooks
- Google Colab
- Kaggle
| license: mit | |
| tags: | |
| - text-to-audio | |
| - controlnet | |
| pipeline_tag: text-to-audio | |
| <img src="https://github.com/haidog-yaqub/EzAudio/blob/main/arts/ezaudio.png?raw=true"> | |
| # EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer | |
| [EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer](https://huggingface.co/papers/2409.10819) | |
| **Abstract:** We introduce EzAudio, a text-to-audio (T2A) generation framework designed to produce high-quality, natural-sounding sound effects. Core designs include: (1) We propose EzAudio-DiT, an optimized Diffusion Transformer (DiT) designed for audio latent representations, improving convergence speed, as well as parameter and memory efficiency. (2) We apply a classifier-free guidance (CFG) rescaling technique to mitigate fidelity loss at higher CFG scores and enhancing prompt adherence without compromising audio quality. (3) We propose a synthetic caption generation strategy leveraging recent advances in audio understanding and LLMs to enhance T2A pretraining. We show that EzAudio, with its computationally efficient architecture and fast convergence, is a competitive open-source model that excels in both objective and subjective evaluations by delivering highly realistic listening experiences. Code, data, and pre-trained models are released at: this https URL . | |
| [](https://haidog-yaqub.github.io/EzAudio-Page/) | |
| [](https://arxiv.org/abs/2409.10819) | |
| [](https://huggingface.co/spaces/OpenSound/EzAudio) | |
| ๐ฃ EzAudio is a diffusion-based text-to-audio generation model. Designed for real-world audio applications, EzAudio brings together high-quality audio synthesis with lower computational demands. | |
| ๐ Play with EzAudio for text-to-audio generation, editing, and inpainting: [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio) | |
| ๐ฎ EzAudio-ControlNet is available: [EzAudio-ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet) | |
| <!-- We want to thank Hugging Face Space and Gradio for providing incredible demo platform. --> | |
| ## Installation | |
| Clone the repository: | |
| ``` | |
| git clone git@github.com:haidog-yaqub/EzAudio.git | |
| ``` | |
| Install the dependencies: | |
| ``` | |
| cd EzAudio | |
| pip install -r requirements.txt | |
| ``` | |
| Download checkponts (Optional): | |
| [https://huggingface.co/OpenSound/EzAudio](https://huggingface.co/OpenSound/EzAudio/tree/main) | |
| ## Usage | |
| You can use the model with the following code: | |
| ```python | |
| from api.ezaudio import EzAudio | |
| import torch | |
| import soundfile as sf | |
| # load model | |
| device = 'cuda' if torch.cuda.is_available() else 'cpu' | |
| ezaudio = EzAudio(model_name='s3_xl', device=device) | |
| # text to audio genertation | |
| prompt = "a dog barking in the distance" | |
| sr, audio = ezaudio.generate_audio(prompt) | |
| sf.write(f'{prompt}.wav', audio, sr) | |
| # audio inpainting | |
| prompt = "A train passes by, blowing its horns" | |
| original_audio = 'ref.wav' | |
| sr, audio = ezaudio.editing_audio(prompt, boundary=2, gt_file=original_audio, | |
| mask_start=1, mask_length=5) | |
| sf.write(f'{prompt}_edit.wav', audio, sr) | |
| ``` | |
| ## Training | |
| #### Autoencoder | |
| Refer to the VAE training section in our work [SoloAudio](https://github.com/WangHelin1997/SoloAudio) | |
| #### T2A Diffusion Model | |
| Prepare your data (see example in `src/dataset/meta_example.csv`), then run: | |
| ```bash | |
| cd src | |
| accelerate launch train.py | |
| ``` | |
| ## Todo | |
| - [x] Release Gradio Demo along with checkpoints [EzAudio Space](https://huggingface.co/spaces/OpenSound/EzAudio) | |
| - [x] Release ControlNet Demo along with checkpoints [EzAudio ControlNet Space](https://huggingface.co/spaces/OpenSound/EzAudio-ControlNet) | |
| - [x] Release inference code | |
| - [x] Release training pipeline and dataset | |
| - [x] Improve API and support automatic ckpts downloading | |
| - [ ] Release checkpoints for stage1 and stage2 [WIP] | |
| ## Reference | |
| If you find the code useful for your research, please consider citing: | |
| ```bibtex | |
| @article{hai2024ezaudio, | |
| title={EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer}, | |
| author={Hai, Jiarui and Xu, Yong and Zhang, Hao and Li, Chenxing and Wang, Helin and Elhilali, Mounya and Yu, Dong}, | |
| journal={arXiv preprint arXiv:2409.10819}, | |
| year={2024} | |
| } | |
| ``` | |
| ## Acknowledgement | |
| Some codes are borrowed from or inspired by: [U-Vit](https://github.com/baofff/U-ViT), [Pixel-Art](https://github.com/PixArt-alpha/PixArt-alpha), [Huyuan-DiT](https://github.com/Tencent/HunyuanDiT), and [Stable Audio](https://github.com/Stability-AI/stable-audio-tools). |