Instructions to use ACE-Step/ACE-Step-v1-3.5B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Diffusers
How to use ACE-Step/ACE-Step-v1-3.5B with Diffusers:
pip install -U diffusers transformers accelerate
import torch from diffusers import DiffusionPipeline # switch to "mps" for apple devices pipe = DiffusionPipeline.from_pretrained("ACE-Step/ACE-Step-v1-3.5B", dtype=torch.bfloat16, device_map="cuda") prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k" image = pipe(prompt).images[0] - ACE-Step
How to use ACE-Step/ACE-Step-v1-3.5B with ACE-Step:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| tags: | |
| - music | |
| - text2music | |
| - acestep | |
| pipeline_tag: text-to-audio | |
| language: | |
| - en | |
| - zh | |
| - de | |
| - fr | |
| - es | |
| - it | |
| - pt | |
| - pl | |
| - tr | |
| - ru | |
| - cs | |
| - nl | |
| - ar | |
| - ja | |
| - hu | |
| - ko | |
| - hi | |
| # ACE-Step: A Step Towards Music Generation Foundation Model | |
|  | |
| ## Model Description | |
| ACE-Step is a novel open-source foundation model for music generation that overcomes key limitations of existing approaches through a holistic architectural design. It integrates diffusion-based generation with Sana's Deep Compression AutoEncoder (DCAE) and a lightweight linear transformer, achieving state-of-the-art performance in generation speed, musical coherence, and controllability. | |
| **Key Features:** | |
| - 15× faster than LLM-based baselines (20s for 4-minute music on A100) | |
| - Superior musical coherence across melody, harmony, and rhythm | |
| - full-song generation, duration control and accepts natural language descriptions | |
| ## Uses | |
| ### Direct Use | |
| ACE-Step can be used for: | |
| - Generating original music from text descriptions | |
| - Music remixing and style transfer | |
| - edit song lyrics | |
| ### Downstream Use | |
| The model serves as a foundation for: | |
| - Voice cloning applications | |
| - Specialized music generation (rap, jazz, etc.) | |
| - Music production tools | |
| - Creative AI assistants | |
| ### Out-of-Scope Use | |
| The model should not be used for: | |
| - Generating copyrighted content without permission | |
| - Creating harmful or offensive content | |
| - Misrepresenting AI-generated music as human-created | |
| ## How to Get Started | |
| see: https://github.com/ace-step/ACE-Step | |
| ## Hardware Performance | |
| | Device | 27 Steps | 60 Steps | | |
| |---------------|----------|----------| | |
| | NVIDIA A100 | 27.27x | 12.27x | | |
| | RTX 4090 | 34.48x | 15.63x | | |
| | RTX 3090 | 12.76x | 6.48x | | |
| | M2 Max | 2.27x | 1.03x | | |
| *RTF (Real-Time Factor) shown - higher values indicate faster generation* | |
| ## Limitations | |
| - Performance varies by language (top 10 languages perform best) | |
| - Longer generations (>5 minutes) may lose structural coherence | |
| - Rare instruments may not render perfectly | |
| - Output Inconsistency: Highly sensitive to random seeds and input duration, leading to varied "gacha-style" results. | |
| - Style-specific Weaknesses: Underperforms on certain genres (e.g. Chinese rap/zh_rap) Limited style adherence and musicality ceiling | |
| - Continuity Artifacts: Unnatural transitions in repainting/extend operations | |
| - Vocal Quality: Coarse vocal synthesis lacking nuance | |
| - Control Granularity: Needs finer-grained musical parameter control | |
| ## Ethical Considerations | |
| Users should: | |
| - Verify originality of generated works | |
| - Disclose AI involvement | |
| - Respect cultural elements and copyrights | |
| - Avoid harmful content generation | |
| ## Model Details | |
| **Developed by:** ACE Studio and StepFun | |
| **Model type:** Diffusion-based music generation with transformer conditioning | |
| **License:** Apache 2.0 | |
| **Resources:** | |
| - [Project Page](https://ace-step.github.io/) | |
| - [Demo Space](https://huggingface.co/spaces/ACE-Step/ACE-Step) | |
| - [GitHub Repository](https://github.com/ACE-Step/ACE-Step) | |
| ## Citation | |
| ```bibtex | |
| @misc{gong2025acestep, | |
| title={ACE-Step: A Step Towards Music Generation Foundation Model}, | |
| author={Junmin Gong, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo}, | |
| howpublished={\url{https://github.com/ace-step/ACE-Step}}, | |
| year={2025}, | |
| note={GitHub repository} | |
| } | |
| ``` | |
| ## Acknowledgements | |
| This project is co-led by ACE Studio and StepFun. |