ACE-Step v1.5 XL SFT Diffusers

Diffusers-format checkpoint of ACE-Step v1.5 XL SFT - the supervised fine-tuned 5B-parameter flow-matching DiT for text-to-music generation (hidden_size=2560, 32 layers, 32 heads; encoder_hidden_size=2048 on the condition encoder).

This repository is the official Diffusers-format version of the ACE-Step v1.5 XL SFT checkpoint. It can be loaded directly with AceStepPipeline, which is available in huggingface/diffusers.

Weights are produced by scripts/convert_ace_step_to_diffusers.py from the upstream release and packaged in the standard Diffusers pipeline layout (model_index.json + one subdirectory per module), so the full pipeline can be loaded in a single from_pretrained call.

Usage

Install Diffusers from source until the next package release includes AceStepPipeline.

pip install git+https://github.com/huggingface/diffusers.git

import torch
import soundfile as sf
from diffusers import AceStepPipeline

pipe = AceStepPipeline.from_pretrained(
    "ACE-Step/acestep-v15-xl-sft-diffusers",
    torch_dtype=torch.bfloat16,
)
pipe = pipe.to("cuda")

# Long-form audio: enable VAE tiling to keep decode memory bounded.
pipe.vae.enable_tiling()

output = pipe(
    prompt="An upbeat synthwave track with driving drums and a catchy lead",
    lyrics="[Verse]\nNeon lights are calling me\n[Chorus]\nRide the wave tonight",
    audio_duration=30.0,
    num_inference_steps=8,
    guidance_scale=7.0,
    shift=3.0,
    generator=torch.Generator(device="cuda").manual_seed(42),
)

audio = output.audios[0]  # (channels, samples), 48 kHz
sf.write("acestep-xl-sft.wav", audio.T.cpu().float().numpy(), pipe.sample_rate)

Unlike the turbo checkpoint, XL SFT is not guidance-distilled. The pipeline uses ACE-Step's APG guidance path when guidance_scale > 1.0; guidance_scale=7.0 and shift=3.0 are the recommended defaults. You can increase num_inference_steps for slower, higher-quality sampling.

For batched prompts with padding and FlashAttention, use the variable-length backend:

pipe.transformer.set_attention_backend("flash_varlen")
pipe.condition_encoder.set_attention_backend("flash_varlen")

For single-prompt generation, the regular flash backend is also suitable.

Repository layout

├── model_index.json
├── transformer/        # AceStepTransformer1DModel (DiT, 5B params, bf16)
├── condition_encoder/  # AceStepConditionEncoder (with baked-in silence_latent)
├── audio_tokenizer/    # AceStepAudioTokenizer
├── audio_token_detokenizer/ # AceStepAudioTokenDetokenizer
├── vae/                # AutoencoderOobleck (48 kHz stereo)
├── text_encoder/       # Qwen3-Embedding-0.6B
├── tokenizer/          # Qwen3 tokenizer
├── scheduler/          # FlowMatchEulerDiscreteScheduler config
└── silence_latent.pt   # Raw reference (kept for debugging; not needed at runtime)

License

ACE-Step weights: MIT (same as upstream)
text_encoder/ (Qwen3-Embedding-0.6B): Apache 2.0 - redistributed per Qwen's license

Citation

@misc{gong2026acestep,
  title = {ACE-Step 1.5: Pushing the Boundaries of Open-Source Music Generation},
  author = {Junmin Gong, Yulin Song, Wenxiao Zhao, Sen Wang, Shengyuan Xu, Jing Guo},
  howpublished = {\url{https://github.com/ace-step/ACE-Step-1.5}},
  year = {2026},
  note = {GitHub repository}
}

Downloads last month: -

Model tree for ACE-Step/acestep-v15-xl-sft-diffusers

Base model

ACE-Step/acestep-v15-xl-sft

Finetuned

(1)

this model