A modular node suite for ComfyUI that implements AceStep 1.5 SFT (Supervised Fine-Tuning), a state-of-the-art music generation model. It starts from the official AceStep workflow and extends it with stronger conditioning control and practical ComfyUI-oriented quality options.
SFT = Supervised Fine-Tuning: A specialized version of AceStep optimized for generating superior quality audio through supervised training.
This package provides eight nodes under audio/AceStep SFT:
| Node | Purpose |
|---|---|
| AceStep 1.5 SFT Model Loader | Loads the diffusion model, CLIP text encoders, and VAE |
| AceStep 1.5 SFT Lora Loader | Applies a LoRA to MODEL + CLIP (chainable) |
| AceStep 1.5 SFT TextEncode | Encodes caption, lyrics, and metadata into conditioning |
| AceStep 1.5 SFT Generate | Diffusion sampler + optional VAE decode |
| AceStep 1.5 SFT Preview Audio | Audio playback with waveform spectrum visualizer |
| AceStep 1.5 SFT Save Audio | Save audio (FLAC/MP3/Opus) with waveform visualizer |
| AceStep 1.5 SFT Get Music Infos | AI-powered audio analysis (tags, BPM, key/scale) |
| AceStep 1.5 SFT Turbo Tag Adapter | Rewrites Turbo-oriented tags into SFT-friendly tags (BETA) |
The workflow is split into dedicated nodes for maximum flexibility:
Model Loader β (model, clip, vae)
β β β
β Lora Loader (optional, chainable)
β β β β
β β TextEncode β
β β β β β
βΌ βΌ βΌ βΌ βΌ
Generate (model, positive, negative, vae)
β β
Preview Audio Save Audio
The node supports three classifier-free guidance modes, each with unique characteristics:
-
APG (Adaptive Projected Guidance) β Recommended
- Dynamic adaptation via momentum buffering
- Gradient clipping with adaptive thresholds
- Orthogonal projection to eliminate unwanted noise
- AceStep SFT Default - best quality and stability balance
-
ADG (Angle-based Dynamic Guidance)
- Angle-based guidance between conditions
- Operates in velocity space (flow matching)
- Ideal for aggressive style distortion
-
Standard CFG
- Traditional Classifier-Free Guidance
- Simple and predictable implementation
- Useful as a comparison baseline
- Auto-Duration: Automatically estimates music duration by analyzing lyric structure
- LLM Encoding: Use Qwen LLM (0.6B or 1.7B/4B) to generate semantic audio codes
- Auto Values: BPM, Time Signature, and Key/Scale automatic (model decides)
- Multilingual Support: Over 23 languages supported
- Audio Tag Extraction: Uses the native ACE-Step Transcriber to extract lyric, vocal, and song-structure tags from audio
- BPM Detection: Automatic tempo detection via librosa
- Key/Scale Detection: Detects musical key and scale (e.g. "G minor")
- JSON Output: Structured
music_infosoutput with all analysis results
Both Preview Audio and Save Audio nodes feature:
- Interactive waveform spectrum display directly on the node (dark background with amplitude bars)
- Play/Pause button with click-to-seek on the waveform
- Time display showing current position and total duration
Save Audio additionally supports:
- Multiple formats: FLAC (lossless), MP3, and Opus
- Quality options: V0, 64k, 96k, 128k, 192k, 320k
- Auto-incrementing filenames with configurable prefix
- Latent-based Refinement: Use
denoise < 1.0withlatent_or_audioconnected to refine existing audio - Accepts AUDIO or LATENT: Connect any audio or latent output for img2img-style editing
- Batch Generation: Generate multiple variations in parallel
- Split Text/Lyric Guidance: Independent
guidance_scale_textandguidance_scale_lyric - Omega Scale: Mean-preserving output reweighting to approximate AceStep scheduler behavior
- ERG Approximation: Node-local prompt energy reweighting via
erg_scale - Guidance Interval Decay: Smoothly decay guidance inside the active interval
- Direct LoRA Application: The Lora Loader takes MODEL + CLIP, applies the LoRA via
comfy.sd.load_lora_for_models(), and outputs the modified MODEL + CLIP - Chainable: Stack multiple Lora Loaders in sequence
- Separate strengths: Independent
strength_modelandstrength_clip - DoRA support: Full DoRA (Weight-Decomposed Low-Rank Adaptation) support with automatic
dora_scaledimension fix - Local
Loras/folder: Drop LoRA files directly into the node'sLoras/folder β they are automatically registered at startup - Auto PEFT/DoRA conversion: PEFT-format LoRAs (
adapter_config.json+adapter_model.safetensors) placed inLoras/are automatically converted to ComfyUI format on first startup
- Latent Shift: Additive anti-clipping correction
- Latent Rescale: Multiplicative scaling for dynamic control
- ComfyUI installed and functional
- CUDA/GPU or equivalent (modern processors)
- Recommended for better output quality (based on practical testing): use the merged SFT+Turbo model.
- Required model files:
- Diffusion model (DiT):
acestep_v1.5_sft.safetensors - Text Encoders:
qwen_0.6b_ace15.safetensors,qwen_1.7b_ace15.safetensors(or 4B) - VAE:
ace_1.5_vae.safetensors
- Diffusion model (DiT):
Download the required models from HuggingFace:
- Diffusion Model (Recommended: merged SFT+Turbo):
-
Alternative Diffusion Model (official SFT):
-
Text Encoders (choose any versions):
- Text Encoders Collection
qwen_0.6b_ace15.safetensors(caption processing)qwen_1.7b_ace15.safetensorsorqwen_4b_ace15.safetensors(audio code generation)
- Text Encoders Collection
-
VAE (Audio codec):
- Clone the repository to your custom nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/jeankassio/ComfyUI-AceStep_SFT.git- Place model files in the appropriate directories:
ComfyUI/models/diffusion_models/ # AceStep 1.5 SFT model
ComfyUI/models/text_encoders/ # Qwen encoders
ComfyUI/models/vae/ # VAE
ComfyUI/models/loras/ # Optional AceStep 1.5 LoRAs
- (Optional) Place LoRAs in the local folder:
ComfyUI/custom_nodes/ComfyUI-AceStep_SFT/Loras/ # Local LoRA folder
You can place LoRAs here in any of these formats:
- ComfyUI format: Single
.safetensorsfile (ready to use) - PEFT/DoRA format: A folder containing
adapter_config.json+adapter_model.safetensors(auto-converted on startup) - Nested zip artifact: If your zip extracted a folder-inside-folder, the node detects this and fixes it automatically
- Restart ComfyUI - the nodes will appear under
audio/AceStep SFT
Loads the AceStep 1.5 diffusion model, dual CLIP text encoders, and audio VAE.
Inputs:
diffusion_model: AceStep 1.5 diffusion model (.safetensors)text_encoder_1: Qwen3-0.6B encoder (caption processing)text_encoder_2: Qwen3 LLM (1.7B or 4B, audio code generation)vae_name: AceStep 1.5 audio VAE
Outputs:
model: MODEL β connect to Lora Loader or Generateclip: CLIP β connect to Lora Loader or TextEncodevae: VAE β connect to Generate
Applies a LoRA directly to the MODEL and CLIP. Multiple Lora Loaders can be chained.
Inputs:
model: MODEL from Model Loader or previous Lora Loaderclip: CLIP from Model Loader or previous Lora Loaderlora_name: LoRA file fromComfyUI/models/lorasor the localLoras/folderstrength_model: strength applied to the diffusion modelstrength_clip: strength applied to the text encoder stack
Outputs:
model: MODEL β connect to next Lora Loader or Generateclip: CLIP β connect to next Lora Loader or TextEncode
| Format | What to place in Loras/ |
Action |
|---|---|---|
ComfyUI .safetensors |
Single file | Used directly |
| PEFT/DoRA directory | Folder with adapter_config.json + adapter_model.safetensors |
Auto-converted to *_comfyui.safetensors on startup |
| Nested zip artifact | Folder containing a .safetensors inside |
Auto-extracted to root on startup |
Encodes caption, lyrics, and metadata into positive and negative conditioning for the Generate node.
Inputs:
clip: CLIP from Model Loader or Lora Loadercaption: Text description of the music (genre, mood, instruments)lyrics: Song lyrics or[Instrumental]instrumental: Force instrumental modeseed,duration,bpm,timesignature,language,keyscale- Optional:
generate_audio_codes,lm_cfg_scale,lm_temperature,lm_top_p,lm_top_k,lm_min_p,lm_negative_prompt - Optional style overrides:
style_tags,style_bpm,style_keyscale(from Music Analyzer)
Outputs:
positive: CONDITIONING β connect to Generatenegative: CONDITIONING β connect to Generate
Diffusion sampler + optional VAE decoder. Requires MODEL and conditioning inputs.
Inputs:
model: MODEL from Model Loader or Lora Loaderpositive: CONDITIONING from TextEncodenegative: CONDITIONING from TextEncode- Sampling:
seed,steps,cfg,sampler_name,scheduler,denoise,duration,infer_method,guidance_mode - Optional:
vae(for audio output),latent_or_audio(for img2img),batch_size - Optional post-processing:
latent_shift,latent_rescale,fade_in_duration,fade_out_duration,voice_boost,use_tiled_vae - Optional guidance:
apg_eta,apg_momentum,apg_norm_threshold,guidance_interval,guidance_interval_decay,min_guidance_scale,guidance_scale_text,guidance_scale_lyric,omega_scale,erg_scale,cfg_interval_start,cfg_interval_end,shift
Outputs:
model: MODEL (passthrough for chaining)vae: VAE (passthrough for chaining)positive: CONDITIONING (passthrough)negative: CONDITIONING (passthrough)latent: LATENT (raw diffusion output)audio: AUDIO (decoded audio, only when VAE is connected)
Previews audio with an interactive waveform spectrum visualizer directly on the node.
Inputs:
audio: AUDIO to preview
Features:
- Interactive waveform display with play/pause button
- Click-to-seek on the waveform
- Current time / total duration display
Saves audio to disk with an interactive waveform spectrum visualizer.
Inputs:
audio: AUDIO to savefilename_prefix: Filename prefix (supports subfolder paths, e.g.audio/AceStep)format: FLAC, MP3, or Opusquality(optional): V0, 64k, 96k, 128k, 192k, 320k (for MP3/Opus)
Features:
- Auto-incrementing filenames (e.g.
AceStep_00001_.flac,AceStep_00002_.flac) - Waveform visualizer with play/pause and seek
- Metadata embedding (prompt, workflow)
AI-powered audio analysis node that extracts descriptive tags, BPM, and key/scale from audio input.
Inputs:
audio: Audio input to analyzeget_tags/get_bpm/get_keyscale: Enable/disable each analysismax_new_tokens: Maximum tokens for transcription outputaudio_duration: Max seconds of audio to analyzetemperature,top_p,top_k,repetition_penalty,seed: Generation parametersunload_model: Free VRAM after analysisuse_flash_attn: Enable Flash Attention 2 (if compatible)
Outputs:
tags: Comma-separated descriptive tags (STRING)bpm: Detected BPM (INT)keyscale: Key and scale e.g. "G minor" (STRING)music_infos: JSON with all results (STRING)
Rewrites Turbo-oriented music tags into shorter SFT-friendly prompt tags.
Inputs:
turbo_tags: Turbo-style tags or captionadaptation_strength: conservative / balanced / aggressivekeep_unknown_tags: Keep tags that were not explicitly mappedadd_sft_bias_tags: Add extra SFT-oriented anchor tags
Outputs:
sft_tags: Adapted comma-separated tags (STRING)notes: Conversion notes (STRING)suggested_cfg: Suggested CFG value (FLOAT)suggested_steps: Suggested steps value (INT)
| Parameter | Range | Description |
|---|---|---|
| model | MODEL | AceStep 1.5 diffusion model from Model Loader or Lora Loader |
| positive | CONDITIONING | Positive conditioning from TextEncode |
| negative | CONDITIONING | Negative conditioning from TextEncode |
| seed | 0 - 2^64 | Seed for reproducibility |
| steps | 1 - 200 | Diffusion inference steps (default: 50) |
| cfg | 1.0 - 20.0 | Classifier-free guidance scale (default: 7.0) |
| sampler_name | - | Sampler (euler, dpmpp, etc.) |
| scheduler | - | Scheduler (normal, karras, etc.) |
| denoise | 0.0 - 1.0 | Denoising strength (1.0 = fresh, < 1.0 = editing) |
| duration | 0.0 - 600.0 | Duration in seconds (0 = auto) |
| infer_method | ode/sde | ODE = deterministic, SDE = stochastic |
| guidance_mode | apg/adg/standard_cfg | Guidance type (default: apg) |
- batch_size (1-16): Number of audios to generate in parallel
- vae: VAE from Model Loader (required for audio output)
- latent_or_audio: Base input for refinement (img2img). Accepts AUDIO or LATENT
- latent_shift (-0.2-0.2, default: 0.0): Additive shift (anti-clipping)
- latent_rescale (0.5-1.5, default: 1.0): Multiplicative scaling
- fade_in_duration / fade_out_duration (0.0-10.0, default: 0.0): Optional linear fades
- use_tiled_vae (default: True): Uses tiled VAE for long audio / low VRAM
- voice_boost (-12.0-12.0, default: 0.0): Output gain in dB
- apg_eta (-10.0-10.0, default: 0.0): Parallel component retention
- apg_momentum (-1.0-1.0, default: -0.75): Momentum buffer coefficient
- apg_norm_threshold (0.0-15.0, default: 2.5): Norm threshold for gradient clipping
- guidance_interval (-1.0-1.0, default: 0.5): Centered guidance interval width
- guidance_interval_decay (0.0-1.0, default: 0.0): Linear decay inside interval
- min_guidance_scale (0.0-30.0, default: 3.0): Lower bound with decay
- guidance_scale_text (-1.0-30.0, default: -1.0): Text-only guidance (split)
- guidance_scale_lyric (-1.0-30.0, default: -1.0): Lyric-only guidance (split)
- omega_scale (-8.0-8.0, default: 0.0): Mean-preserving reweighting
- erg_scale (-0.9-2.0, default: 0.0): Prompt energy reweighting
- cfg_interval_start / cfg_interval_end (0.0-1.0): Schedule fraction range
- shift (0.0-5.0, default: 3.0): Timestep schedule shift
| Parameter | Range | Description |
|---|---|---|
| clip | CLIP | CLIP from Model Loader or Lora Loader |
| caption | text | Music description (genre, mood, instruments) |
| lyrics | text | Song lyrics or [Instrumental] |
| instrumental | boolean | Force instrumental mode |
| seed | 0 - 2^64 | Seed |
| duration | 0.0 - 600.0 | Duration in seconds (0 = auto from lyrics) |
| bpm | 0 - 300 | Beats per minute (0 = auto) |
| timesignature | auto/2/3/4/6 | Time signature numerator |
| language | - | Lyric language (en, ja, zh, es, pt, etc.) |
| keyscale | auto/... | Key and scale (e.g. "C major") |
- generate_audio_codes (default: True): Enable LLM audio code generation
- lm_cfg_scale (0.0-100.0, default: 2.0): LLM CFG scale
- lm_temperature (0.0-2.0, default: 0.85): LLM sampling temperature
- lm_top_p (0.0-2000.0, default: 0.9): Nucleus sampling
- lm_top_k (0-100, default: 0): Top-k sampling
- lm_min_p (0.0-1.0, default: 0.0): Minimum probability
- lm_negative_prompt: Negative prompt for LLM CFG
- style_tags: Appended to caption when connected
- style_bpm: Overrides bpm when > 0
- style_keyscale: Overrides keyscale when not empty
Model Loader:
diffusion_model: "acestep_v1.5_sft.safetensors"
text_encoder_1: "qwen_0.6b_ace15.safetensors"
text_encoder_2: "qwen_1.7b_ace15.safetensors"
vae_name: "ace_1.5_vae.safetensors"
β model, clip, vae
TextEncode:
clip: (from Model Loader)
caption: "upbeat electronic dance music with synthesizers"
lyrics: [Instrumental]
instrumental: True
duration: 60.0
β positive, negative
Generate:
model: (from Model Loader)
positive: (from TextEncode)
negative: (from TextEncode)
vae: (from Model Loader)
cfg: 7.0, steps: 50, guidance_mode: "apg"
β audio
Preview Audio:
audio: (from Generate)
Model Loader β model, clip, vae
β model, clip
Lora Loader:
lora_name: "ace-step15-style1.safetensors"
strength_model: 0.7
strength_clip: 0.0
β model, clip
β model, clip
Lora Loader:
lora_name: "Ace-Step1.5-TechnoRain.safetensors"
strength_model: 0.35
strength_clip: 0.0
β model, clip
TextEncode (clip from last Lora Loader) β positive, negative
Generate (model from last Lora Loader, vae from Model Loader) β audio
Save Audio (format: mp3, quality: 320k)
Generate:
latent_or_audio: (existing audio)
denoise: 0.7 (preserves 30% of source)
duration: 0 (uses input duration)
β Refines audio while preserving original characteristics
Music Analyzer:
audio: (input audio file)
β tags, bpm, keyscale
TextEncode:
style_tags: (from Music Analyzer)
style_bpm: (from Music Analyzer)
style_keyscale: (from Music Analyzer)
β positive, negative
Generate β Save Audio (format: flac)
Solution: Use negative latent_shift (e.g., -0.1) to reduce amplitude before VAE decoding
Solution: Increase apg_norm_threshold (e.g., 3.0-4.0) for more gradient clipping
Solution:
- Use
guidance_mode: "apg"(recommended) - Start from
steps: 50,cfg: 7.0,sampler_name: "euler",scheduler: "normal",infer_method: "ode"
Solution:
- Lower
strength_modelfirst, e.g.0.2to0.6 - Set
strength_clipto0.0unless the LoRA explicitly targets the text encoders - Compare
guidance_mode: "standard_cfg"vs"apg"for that LoRA - Avoid stacking multiple strong LoRAs at full strength
Cause: DoRA LoRAs store dora_scale as a 1D tensor [N]. ComfyUI's weight_decompose expects [N,1].
Solution: This is automatically fixed by the Lora Loader β all dora_scale tensors are unsqueezed to 2D [N,1] at load time.
Solution:
- Place the PEFT folder (containing
adapter_config.json+adapter_model.safetensors) insideComfyUI-AceStep_SFT/Loras/ - Restart ComfyUI β the conversion runs automatically on startup
- Check the console for
[AceStep SFT] Converted PEFT/DoRA β ComfyUI: ...message - The converted file appears as
*_comfyui.safetensorsin the dropdown
Solution: Reduce batch_size, lower steps to ~20, or use "karras" scheduler
| Aspect | APG | ADG | Standard CFG |
|---|---|---|---|
| Quality | βββββ | ββββ | βββ |
| Stability | βββββ | ββββ | ββ |
| Dynamics | Natural | Aggressive | Predictable |
| Computation | Normal | Normal | Minimal |
| Recommended | β Yes | For extreme styles | Baseline |
- Use
guidance_mode=apgwithsteps=50to64for best quality - For img2img refinement, start with
denoise=0.5to0.7to preserve the original character - Mild vocal hiss is usually a generation artifact; APG and slightly higher step counts generally help more than raw
cfg - Simplify overly dense or contradictory tags for cleaner results
- AceStep 1.5: ICML 2024 (Learning Universal Features for Efficient Audio Generation)
- Flow Matching: Liphardt et al. 2024 (Generative Modeling by Estimating Gradients of the Data Distribution)
- APG/ADG: Techniques aligned with official AceStep paper
- ComfyUI: Modular node graph architecture for batch generation
MIT License - Feel free to use in personal or commercial projects
Issues and PRs are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Recommended maximum duration: 240 seconds (GPU memory)
- Maximum batch size: Depends on your GPU (start with 1-2)
- SFT models: These models are specific to Supervised Fine-Tuning - not tested with non-SFT models
- Rights and attribution: Respect model and dataset usage rights
Built on the AceStep SFT workflow and extended with modular nodes, advanced guidance, waveform visualization, and quality controls for ComfyUI.
For bugs, questions, or suggestions: open an issue on the repository! π΅
An all-in-one node for ComfyUI that implements AceStep 1.5 SFT (Supervised Fine-Tuning), a state-of-the-art music generation model. It starts from the official AceStep workflow and extends it with stronger conditioning control and practical ComfyUI-oriented quality options.
SFT = Supervised Fine-Tuning: A specialized version of AceStep optimized for generating superior quality audio through supervised training.
This package currently provides four nodes under audio/AceStep SFT:
- AceStep 1.5 SFT Generate: all-in-one generation, editing, and decoding
- AceStep 1.5 SFT Music Analyzer: AI-powered audio analysis (tags, BPM, key/scale)
- AceStep 1.5 SFT Lora Loader: chainable LoRA stack builder for AceStep 1.5 SFT
- AceStep 1.5 SFT Turbo Tag Adapter: rewrites Turbo-oriented tags into shorter SFT-friendly prompt tags
The AceStepSFTGenerate node encapsulates the entire music generation workflow:
- Latent Creation - Generates initial latents or loads from
latent_or_audioinput - Text Encoding - Processes captions, lyrics, and metadata via multiple CLIP encoders
- Diffusion Sampling - Runs the diffusion model with advanced guidance control
- Audio Decoding - Converts latents to high-quality audio via VAE
The node supports three classifier-free guidance modes, each with unique characteristics:
-
APG (Adaptive Projected Guidance) β Recommended
- Dynamic adaptation via momentum buffering
- Gradient clipping with adaptive thresholds
- Orthogonal projection to eliminate unwanted noise
- AceStep SFT Default - best quality and stability balance
-
ADG (Angle-based Dynamic Guidance)
- Angle-based guidance between conditions
- Operates in velocity space (flow matching)
- Ideal for aggressive style distortion
- Adaptive clipping based on angle between x0_cond and x0_uncond
-
Standard CFG
- Traditional Classifier-Free Guidance
- Simple and predictable implementation
- Useful as a comparison baseline
- Auto-Duration: Automatically estimates music duration by analyzing lyric structure
- LLM Encoding: Use Qwen LLM (0.6B or 1.7B/4B) to generate semantic audio codes
- Auto Values: BPM, Time Signature, and Key/Scale automatic (model decides)
- Multilingual Support: Over 23 languages supported
- Audio Tag Extraction: Uses the native ACE-Step Transcriber to extract lyric, vocal, and song-structure tags from audio
- BPM Detection: Automatic tempo detection via librosa
- Key/Scale Detection: Detects musical key and scale (e.g. "G minor")
- JSON Output: Structured
music_infosoutput with all analysis results - Generation Parameters: Control temperature, top_p, top_k, repetition_penalty, and seed
- Auto Model Download: Models are downloaded on first use (~1-7 GB each)
| Model | Size | Type | Best For |
|---|---|---|---|
| ACE-Step-Transcriber | 22.4 GB download | Audio-to-Text | Native ACE-Step 1.5 transcription for lyrics, singing voice, structure tags, and instrument hints |
This node is now dedicated to the native ACE-Step-Transcriber workflow. It uses the model's native prompt format, structured transcription output, and derives tags from language, lyrics, section markers such as verse/chorus/bridge, and optional instrument annotations.
- Latent-based Refinement: Use
denoise < 1.0withlatent_or_audioconnected to refine existing audio - Accepts AUDIO or LATENT: Connect any audio or latent output for img2img-style editing
- Batch Generation: Generate multiple variations in parallel
- Split Text/Lyric Guidance: Independent
guidance_scale_textandguidance_scale_lyric - Omega Scale: Mean-preserving output reweighting to approximate AceStep scheduler behavior
- ERG Approximation: Node-local prompt energy reweighting via
erg_scale - Guidance Interval Decay: Smoothly decay guidance inside the active interval
- Chainable LoRA Loader: Stack one or more AceStep LoRAs before generation
- Separate strengths: Independent
strength_modelandstrength_clip - Single Generate input: Final LoRA stack plugs into the
lorainput on Generate - Local
Loras/folder: Drop LoRA files directly into the node'sLoras/folder β they are automatically registered at startup - Auto PEFT/DoRA conversion: PEFT-format LoRAs (
adapter_config.json+adapter_model.safetensors) placed inLoras/are automatically converted to ComfyUI format on first startup - DoRA support: Full DoRA (Weight-Decomposed Low-Rank Adaptation) support with automatic
dora_scaledimension fix for ComfyUI compatibility
- Latent Shift: Additive anti-clipping correction
- Latent Rescale: Multiplicative scaling for dynamic control
- ComfyUI installed and functional
- CUDA/GPU or equivalent (modern processors)
- Recommended for better output quality (based on practical testing): use the merged SFT+Turbo model.
- Required model files:
- Diffusion model (DiT):
acestep_v1.5_sft.safetensors - Text Encoders:
qwen_0.6b_ace15.safetensors,qwen_1.7b_ace15.safetensors(or 4B) - VAE:
ace_1.5_vae.safetensors
- Diffusion model (DiT):
Download the required models from HuggingFace:
- Diffusion Model (Recommended: merged SFT+Turbo):
-
Alternative Diffusion Model (official SFT):
-
Text Encoders (choose any versions):
- Text Encoders Collection
qwen_0.6b_ace15.safetensors(caption processing)qwen_1.7b_ace15.safetensorsorqwen_4b_ace15.safetensors(audio code generation)
- Text Encoders Collection
-
VAE (Audio codec):
- Clone the repository to your custom nodes folder:
cd ComfyUI/custom_nodes
git clone https://github.com/jeankassio/ComfyUI-AceStep_SFT.git- Place model files in the appropriate directories:
ComfyUI/models/diffusion_models/ # AceStep 1.5 SFT model
ComfyUI/models/text_encoders/ # Qwen encoders
ComfyUI/models/vae/ # VAE
ComfyUI/models/loras/ # Optional AceStep 1.5 LoRAs
- (Optional) Place LoRAs in the local folder:
ComfyUI/custom_nodes/ComfyUI-AceStep_SFT/Loras/ # Local LoRA folder
You can place LoRAs here in any of these formats:
- ComfyUI format: Single
.safetensorsfile (ready to use) - PEFT/DoRA format: A folder containing
adapter_config.json+adapter_model.safetensors(auto-converted on startup) - Nested zip artifacts: If your zip extracted a folder-inside-folder, the node detects this and fixes it automatically
- Restart ComfyUI - the node will appear under
audio/AceStep SFT
Main all-in-one node for text-to-music generation, latent-based audio refinement, and VAE decoding.
AI-powered audio analysis node that extracts descriptive tags, BPM, and key/scale from audio input.
Inputs:
audio: Audio input to analyzemodel: AI model selection (9 models, auto-downloaded)get_tags/get_bpm/get_keyscale: Enable/disable each analysismax_new_tokens: Maximum tokens for generative modelsaudio_duration: Max seconds of audio to analyzetemperature,top_p,top_k,repetition_penalty,seed: Generation parametersunload_model: Free VRAM after analysisuse_flash_attn: Enable Flash Attention 2 (if compatible)
Outputs:
tags: Comma-separated descriptive tags (STRING)bpm: Detected BPM as string e.g. "129bpm" (STRING)keyscale: Key and scale e.g. "G minor" (STRING)music_infos: JSON with all results (STRING)
Chainable utility node that builds a LoRA stack for AceStep 1.5 SFT.
Inputs:
lora_name: LoRA file fromComfyUI/models/lorasor the localLoras/folderstrength_model: strength applied to the diffusion modelstrength_clip: strength applied to the text encoder stacklora(optional): upstream AceStep LoRA stack
Output:
lora: connect to another Lora Loader or directly into Generate
| Format | What to place in Loras/ |
Action |
|---|---|---|
ComfyUI .safetensors |
Single file | Used directly |
| PEFT/DoRA directory | Folder with adapter_config.json + adapter_model.safetensors |
Auto-converted to *_comfyui.safetensors on startup |
| Nested zip artifact | Folder containing a .safetensors inside |
Auto-extracted to root on startup |
The auto-conversion handles:
- Key remapping:
lora_A/lora_Bβlora_down/lora_up - DoRA support:
lora_magnitude_vectorβdora_scale(with correct 2D shape) - Per-layer alpha injection from
adapter_config.json(supportsalpha_patternandrank_pattern)
| Parameter | Range | Description |
|---|---|---|
| diffusion_model | - | Path to DiT model (AceStep 1.5 SFT) |
| text_encoder_1 | - | Qwen3 0.6B Encoder (caption processing) |
| text_encoder_2 | - | Qwen3 1.7B/4B Encoder (audio code generation) |
| vae_name | - | AceStep 1.5 VAE |
| caption | - | Text description of music (genre, mood, instruments) |
| lyrics | - | Song lyrics or [Instrumental] |
| instrumental | boolean | Force instrumental mode (overrides lyrics) |
| seed | 0 - 2^64 | Seed for reproducibility |
| steps | 1 - 200 | Diffusion inference steps (default: 50 for ACE-Step 1.5 SFT) |
| cfg | 1.0 - 20.0 | Classifier-free guidance scale (default: 7.0; typical 7.0-9.0 for ACE-Step 1.5) |
| sampler_name | - | Sampler (euler, dpmpp, etc.) |
| scheduler | - | Scheduler (normal, karras, exponential, etc.; default: normal) |
| denoise | 0.0 - 1.0 | Denoising strength (1.0 = fresh generation, < 1.0 = editing) |
| infer_method | ode/sde | ODE keeps the selected sampler behavior; SDE remaps default Euler/Heun choices to a stochastic sampler |
| guidance_mode | apg/adg/standard_cfg | Guidance type (default: apg) |
| duration | 0.0 - 600.0 | Duration in seconds (default: 60.0, 0 = auto) |
| bpm | 0 - 300 | Beats per minute (0 = auto, model decides) |
| timesignature | auto/2/3/4/6 | Time signature numerator |
| language | - | Lyric language (en, ja, zh, es, pt, etc.) |
| keyscale | auto/... | Key and scale (e.g., "C major" or "D minor") |
- batch_size (1-16): Number of audios to generate in parallel
- latent_or_audio: Base input for refinement (img2img). Accepts AUDIO or LATENT. Use
denoise < 1.0to refine this input. Withduration=0, duration is derived from the connected input. - lora: AceStep LoRA stack from one or more
AceStep 1.5 SFT Lora Loadernodes
- generate_audio_codes (default: True): Enable/disable LLM audio code generation for semantic structure
- lm_cfg_scale (0.0-100.0, default: 2.0): LLM classifier-free guidance scale
- lm_temperature (0.0-2.0, default: 0.85): LLM sampling temperature
- lm_top_p (0.0-2000.0, default: 0.9): Nucleus sampling parameter
- lm_top_k (0-100, default: 0): Top-k sampling
- lm_min_p (0.0-1.0, default: 0.0): Minimum probability threshold
- lm_negative_prompt: Negative prompt for LLM CFG
- latent_shift (-0.2-0.2, default: 0.0): Additive shift (anti-clipping)
- latent_rescale (0.5-1.5, default: 1.0): Multiplicative scaling
- normalize_peak (default: False): Legacy hard normalization to 0 dBFS after VAE decode
- enable_normalization (default: True): Peak-normalize output to a target dBFS level
- normalization_db (-10.0-0.0, default: -1.0): Target peak level when normalization is enabled
- fade_in_duration / fade_out_duration (0.0-10.0, default: 0.0): Optional linear fades after normalization
- use_tiled_vae (default: True): Uses tiled VAE encode/decode for better long-audio and low-VRAM robustness
- voice_boost (-12.0-12.0, default: 0.0): Simple output gain in dB before normalization
- apg_momentum (-1.0-1.0, default: -0.75): Momentum buffer coefficient
- apg_norm_threshold (0.0-10.0, default: 2.5): Norm threshold for gradient clipping
- guidance_interval (-1.0-1.0, default: 0.5): Official centered guidance interval control
- guidance_interval_decay (0.0-1.0, default: 0.0): Linear decay inside the active guidance interval
- min_guidance_scale (0.0-30.0, default: 3.0): Lower bound when interval decay is enabled
- guidance_scale_text (-1.0-30.0, default: -1.0): Text-only guidance scale,
-1inheritscfg - guidance_scale_lyric (-1.0-30.0, default: -1.0): Lyric-only delta guidance scale,
-1inheritscfg - omega_scale (-8.0-8.0, default: 0.0): Mean-preserving output reweighting
- erg_scale (-0.9-2.0, default: 0.0): Prompt/lyric conditioning energy reweighting
- cfg_interval_start (0.0-1.0, default: 0.0): Start applying guidance at this schedule fraction
- cfg_interval_end (0.0-1.0, default: 1.0): Stop applying guidance at this schedule fraction
- shift (1.0-5.0, default: 3.0): Schedule shift (3.0 = Gradio default)
- custom_timesteps: Custom comma-separated timesteps (overrides steps, shift, scheduler)
The node automatically manages latent creation or reuse:
ββ If latent_or_audio provided:
β ββ AUDIO: Resamples to VAE SR (48kHz), normalizes channels, encodes via VAE
β ββ LATENT: Uses directly as latent_image
β ββ Duration derived from input when duration=0
β
ββ If no latent_or_audio:
ββ Creates zero latent (pure noise) [batch_size, 64, latent_length]
Automatic Sizing: Duration in seconds is converted to latent length via:
latent_length = max(10, round(duration * vae_sample_rate / 1920))
When duration <= 0, the node analyzes lyric structure:
[Intro/Outro] = 8 beats (~1 bar 4/4)
[Instrumental/Solo] = 16 beats (~2 bars 4/4)
Verse/Chorus β ~2 beats per 2 words (typical singing rate)
Section transitions = 4 beats
Empty lines = 2 beats (pause)
Result: duration = beats * (60 / bpm)
Metadata (bpm, duration, key/scale, time sig) are encoded in multiple representations:
- Structured YAML (Chain-of-Thought):
bpm: 120
caption: "upbeat electronic dance"
duration: 120
keyscale: "G major"
language: "en"
timesignature: 4- LLM Template (for audio code generation via Qwen):
<|im_start|>system
# Instruction
Generate audio semantic tokens...
<|im_end|>
<|im_start|>user
# Caption
upbeat electronic dance
# Lyric
[Verse 1]...
<|im_end|>
<|im_start|>assistant
<think>
{YAML above}
</think>
<|im_end|>
- Qwen3-0.6B Template (direct metadata):
# Instruction
# Caption
upbeat electronic dance
# Metas
- bpm: 120
- timesignature: 4
- keyscale: G major
- duration: 120 seconds
<|endoftext|>
# Phase 1: Compute conditional difference
diff = pred_cond - pred_uncond
# Phase 2: Apply smooth momentum
if momentum_buffer:
diff = momentum * running_avg + diff
# Phase 3: Norm clipping
norm = ||diff||β
scale = min(1, norm_threshold / norm)
diff = diff * scale
# Phase 4: Orthogonal decomposition
diff_parallel = projection of diff onto pred_cond
diff_orthogonal = diff - diff_parallel
# Phase 5: Final guidance
guidance = pred_cond + (cfg_scale - 1) * (diff_orthogonal + eta * diff_parallel)Why It Works:
- Orthogonal projection removes collinear components that amplify noise
- Momentum smooths large jumps between timesteps
- Adaptive clipping prevents gradient explosion
- Result: cleaner and more stable audio
# Based on cosine angles between x0_cond and x0_uncond
# Dynamically adjusts guidance based on alignment
# Uses trigonometry for aggressive style deformation
When latent_or_audio is connected with denoise < 1.0, the node operates in img2img mode:
- The input audio is encoded via VAE (or the latent is used directly)
- A fraction of noise is added based on
denoisestrength - The diffusion model refines the noisy latent while preserving the original structure
- Use
guidance_mode=apgwithsteps=50to64for best quality - For img2img refinement, start with
denoise=0.5to0.7to preserve the original character - Mild vocal hiss is usually a generation artifact; APG and slightly higher step counts generally help more than raw
cfg - Simplify overly dense or contradictory tags for cleaner results
| Aspect | APG | ADG | Standard CFG |
|---|---|---|---|
| Quality | βββββ | ββββ | βββ |
| Stability | βββββ | ββββ | ββ |
| Dynamics | Natural | Aggressive | Predictable |
| Computation | Normal | Normal | Minimal |
| Recommended | β Yes | For extreme styles | Baseline |
AceStepSFTGenerate:
caption: "upbeat electronic dance music with synthesizers"
lyrics: [Instrumental]
instrumental: True
duration: 60.0
cfg: 7.0
steps: 50
sampler_name: "euler"
scheduler: "normal"
guidance_mode: "apg"
β Generates a strong 60s ACE-Step 1.5 SFT baseline render
AceStepSFTGenerate:
latent_or_audio: (mixer output)
caption: "make it more orchestral"
denoise: 0.7 (preserves 30% of source)
duration: 0 (uses input duration)
β Refines audio while preserving original characteristics
AceStepSFTGenerate:
batch_size: 4
seed: 42 (varies automatically)
β Creates 4 variations with similar characteristics
AceStep 1.5 SFT Lora Loader:
lora_name: "Ace-Step1.5/ace-step15-style1.safetensors"
strength_model: 0.7
strength_clip: 0.0
β
AceStep 1.5 SFT Lora Loader:
lora_name: "Ace-Step1.5/Ace-Step1.5-TechnoRain.safetensors"
strength_model: 0.35
strength_clip: 0.0
β
AceStep 1.5 SFT Generate:
lora: (stack output)
Note: AceStep LoRAs are now supported directly by this package. If a specific LoRA produces unstable audio, start by lowering strength_model and compare apg against standard_cfg.
AceStepSFTMusicAnalyzer:
audio: (input audio file)
model: "Qwen2-Audio-7B-Instruct"
β tags: "dancehall beat, powerful bassline, vocal samples, melancholic"
β bpm: "129bpm"
β keyscale: "G minor"
β
AceStepSFTGenerate:
caption: (tags from analyzer)
bpm: 129
keyscale: "G minor"
β Generates new music matching the analyzed style
Solution: Use negative latent_shift (e.g., -0.1) to reduce amplitude before VAE decoding
Solution: Increase apg_norm_threshold (e.g., 3.0-4.0) for more gradient clipping
Solution:
- Use
guidance_mode: "apg"(recommended) - Start from
steps: 50,cfg: 7.0,sampler_name: "euler",scheduler: "normal",infer_method: "ode" - Keep
enable_normalization: Truewithnormalization_db: -1.0for cleaner final level management
Solution:
- Lower
strength_modelfirst, e.g.0.2to0.6 - Set
strength_clipto0.0unless the LoRA explicitly targets the text encoders - Compare
guidance_mode: "standard_cfg"vs"apg"for that LoRA - Avoid stacking multiple strong LoRAs at full strength
Cause: DoRA LoRAs store dora_scale as a 1D tensor [N]. ComfyUI's weight_decompose divides it by weight_norm [N,1], which causes PyTorch to broadcast [1,N]/[N,1] β [N,N] instead of the expected [N,1].
Solution: This is automatically fixed by the node β all dora_scale tensors are unsqueezed to 2D [N,1] at load time. If you still see this error, ensure you are using the latest version of this node.
Solution:
- Place the PEFT folder (containing
adapter_config.json+adapter_model.safetensors) insideComfyUI-AceStep_SFT/Loras/ - Restart ComfyUI β the conversion runs automatically on startup
- Check the console for
[AceStep SFT] Converted PEFT/DoRA β ComfyUI: ...message - The converted file appears as
*_comfyui.safetensorsin the dropdown
Solution: Reduce batch_size, lower steps to ~20, or use "karras" scheduler
- AceStep 1.5: ICML 2024 (Learning Universal Features for Efficient Audio Generation)
- Flow Matching: Liphardt et al. 2024 (Generative Modeling by Estimating Gradients of the Data Distribution)
- APG/ADG: Techniques aligned with official AceStep paper
- ComfyUI: Modular node graph architecture for batch generation
MIT License - Feel free to use in personal or commercial projects
Issues and PRs are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Recommended maximum duration: 240 seconds (GPU memory)
- Maximum batch size: Depends on your GPU (start with 1-2)
- SFT models: These models are specific to Supervised Fine-Tuning - not tested with non-SFT models
- Rights and attribution: Respect model and dataset usage rights
Built on the AceStep SFT workflow and extended with advanced guidance and quality controls for ComfyUI.
For bugs, questions, or suggestions: open an issue on the repository! π΅
