ACE-Step 1.5: Open-Source Music AI That Runs Locally and Matches Suno v5 in Benchmarks

The AI music generation space has been dominated by cloud-based services — Suno, Udio, Mureka — each requiring a subscription and offering limited control over the underlying models. ACE-Step 1.5, released in 2026 by ACE Studio and StepFun, shifts this dynamic entirely. It is an open-source music foundation model under the MIT license that matches or exceeds Suno v4.5 across standard evaluation metrics, approaches Suno v5 quality, generates a full song in under 2 seconds on an A100, and runs on consumer hardware with as little as 4 GB of VRAM.

Architecture: Language Model as Composer, Diffusion Transformer as Producer

ACE-Step 1.5 uses a hybrid architecture that separates planning from generation. A Language Model (LM) — available in 0.6B, 1.7B, and 4B parameter variants based on Qwen3 — acts as an omni-capable planner. It transforms simple text prompts into comprehensive song blueprints: metadata (BPM, key, time signature), lyrics, and detailed audio captions. These blueprints guide a Diffusion Transformer (DiT) that generates the actual audio waveform.

The critical innovation is how the LM and DiT are aligned. Rather than relying on external reward models or human preference data, ACE-Step uses intrinsic reinforcement learning — the model trains itself using only its internal mechanisms. This eliminates biases inherent in external scoring systems and produces more consistent outputs across diverse genres.

The LM also performs Chain-of-Thought reasoning to expand simple prompts. A description like “upbeat indie rock song about road trips” becomes a full production specification with genre tags, instrumentation, vocal style, and structural markers. Users do not need music production expertise to get professional-sounding results.

Performance: 2 Seconds Per Song on A100

Generation speed is where ACE-Step 1.5 stands apart from virtually every competitor. A complete 3–4 minute song generates in under 2 seconds on an NVIDIA A100 and under 10 seconds on an RTX 3090. Most competing models — both open-source and commercial — take 20 seconds to several minutes for the same task.

The turbo mode (8 diffusion steps) prioritizes speed with slightly reduced quality. The sft mode (50 steps) delivers the best audio fidelity. Batch generation supports up to 8 songs simultaneously. The model handles durations from 10 seconds to 10 minutes (600 seconds), giving flexibility for short loops, full tracks, or extended compositions.

Hardware Requirements: 4 GB VRAM Minimum

The base 2B DiT model runs with less than 4 GB of VRAM using INT8 quantization and full CPU offloading. This puts it within reach of mid-range GPUs like the RTX 3050 or even some integrated solutions. The XL series with a 4B-parameter DiT decoder requires 12 GB VRAM (with offload) or 20 GB (without offload) for the higher-quality variant.

Platform support is broad: NVIDIA CUDA, AMD ROCm (Linux and Windows), Apple Silicon via MLX (M1/M2/M3/M4), and Intel XPU. CPU-only mode works but with significantly reduced speed and quality. Portable packages are available for Windows and macOS, along with Docker images for cloud GPU deployments on RunPod and similar platforms.

Benchmark Quality: How It Compares to Suno v5

ACE-Step 1.5-XL (4B DiT) scores at or above Suno v5 levels on standard music evaluation metrics. On a 1–5 scale: musicality 4.79 vs 4.72 (Suno v5), vocal naturalness 4.65 vs 4.56, style adherence 4.78 vs 4.71, lyric adherence 4.72 vs 4.63. The standard 2B model sits between Suno v4.5 and Suno v5.

Model	Musicality	Vocal	Style	Lyrics	Cost
ACE-Step 1.5-XL (4B)	4.79	4.65	4.78	4.72	Free (MIT)
Suno v5	4.72	4.56	4.71	4.63	$8–24/mo
Suno v4.5	4.64	4.49	4.63	4.53	$8–24/mo
ACE-Step 1.5 (2B)	4.67	4.59	4.72	4.66	Free (MIT)
Mureka V8	4.46	4.41	4.52	4.48	Subscription
HeartMuLa	4.55	4.45	4.69	4.55	Open-source

Suno v5.5 — the latest commercial release — remains the quality benchmark that open-source has not fully reached. But the gap has narrowed considerably, and ACE-Step’s advantages in cost, privacy, and customization offset the marginal quality difference for most use cases.

LoRA Personalization: Your Own Sound in One Hour

One of ACE-Step 1.5’s most powerful features is LoRA training on custom audio. With as few as 8 songs and approximately 1 hour of training on an RTX 3090 (12 GB VRAM), users can fine-tune the model to capture a specific vocal style, genre, or instrumental character. The Gradio UI includes a dedicated tab for one-click data annotation and LoRA training.

This enables scenarios that cloud services cannot offer: training on your own vocals, emulating a specific artist’s style (responsibly), or creating a signature sound for a music project. Community tools like Side-Step extend training capabilities with LoKR adapters, corrected timestep sampling, and gradient sensitivity analysis.

Beyond Text-to-Music: Covers, Repaints, and Stem Separation

ACE-Step 1.5 supports multiple generation modes. Cover Generation creates new versions of existing audio with different styles or vocals. Repaint regenerates selected portions while preserving the rest. Vocal2BGM auto-generates accompaniment for a cappella tracks. Track Separation splits audio into individual stems (vocals, drums, bass, other).

The Multi-Track mode layers new elements onto existing compositions — similar to Suno Studio’s Add Layer feature. Extract mode pulls individual elements from mixed audio. Complete mode finishes incomplete compositions. Audio Understanding automatically extracts BPM, key, time signature, and generates captions from any audio input. All functions run locally without generation limits.

Ecosystem: VST3 Plugin, ComfyUI Nodes, and Alternative Studios

The ACE-Step ecosystem has grown rapidly. acestep.vst3 is an official VST3 plugin (C++17/GGML) for DAW integration — works with Ableton Live, FL Studio, Logic Pro, and runs on CPU, CUDA, Metal, or Vulkan. acestep.cpp is a portable C++ implementation with a built-in HTTP server and Svelte web UI.

Alternative frontends include ace-step-ui (Spotify-inspired), ace-step-studio (Suno-style workflow), Tadpole Studio (AI DJ, radio, playlists, 11 themes), and Majik’s Music Studio (native macOS/Linux app with full MLX acceleration). For ComfyUI, multiple node packs support generation, covers, LoRA training, and streaming workflows.

Generative Radio creates a continuous AI radio stream where Qwen3 generates prompts and ACE-Step generates songs back-to-back. DEMON provides streaming diffusion with TensorRT acceleration and hot-mutable controls. For those without capable hardware, acemusic.ai offers free browser-based generation — no GPU required.

Practical Access: Telegram Bot for Personalized Songs

For users who want AI-generated music without local setup or technical configuration, the Telegram bot @singingcard2025_bot creates personalized musical compositions directly in chat. Users describe the occasion, recipient, and mood — the bot generates a complete song with lyrics, melody, and vocals in 1–3 minutes.

The bot supports 10 languages for the interface (English, Russian, Ukrainian, Spanish, Portuguese, Hindi, Indonesian, Vietnamese, French, Italian) and generates songs in any language. Common use cases: birthday songs, anniversary gifts, wedding music, or just a personalized track for someone special. New users receive 20 free credits (enough for 1–2 songs), with additional credits available via Telegram Stars from 50 Stars (~$1).

ACE-Step 1.5 vs Suno: Which to Choose

Suno offers cloud convenience with models up to v5.5. Free tier: 10 songs/day on v4.5. Pro ($8/mo): 500 songs/month. Premier ($24/mo): 2,000 songs/month with Suno Studio access. Commercial rights on paid plans only. No model personalization. No local deployment.

ACE-Step 1.5 runs entirely locally under the MIT license. No subscriptions, no generation limits, full commercial rights, LoRA personalization, DAW integration, offline operation. Requires a GPU with 4+ GB VRAM.

For occasional use — one song per week for a social media post — Suno’s browser interface is more convenient. For regular production, privacy-sensitive work, commercial projects without subscription lock-in, or custom model training — ACE-Step 1.5 is the stronger choice. For zero-setup access, the Telegram bot or acemusic.ai bridge the gap.

Bottom Line

ACE-Step 1.5 is the first genuinely competitive open-source entry in AI music generation. Its hybrid LM + DiT architecture produces commercial-grade songs in seconds on consumer hardware, supports 50+ languages, and enables personalization through LoRA training on just 8 tracks. The MIT license means no restrictions on commercial use.

For music producers, content creators, developers building audio features, or anyone curious about AI music — the barrier to entry has dropped from a monthly subscription to a git clone and a GPU. Repository: github.com/ace-step/ACE-Step-1.5. Community ecosystem: awesome-ace-step.

Architecture: Language Model as Composer, Diffusion Transformer as Producer

Performance: 2 Seconds Per Song on A100

Hardware Requirements: 4 GB VRAM Minimum

Benchmark Quality: How It Compares to Suno v5

LoRA Personalization: Your Own Sound in One Hour

Beyond Text-to-Music: Covers, Repaints, and Stem Separation

Ecosystem: VST3 Plugin, ComfyUI Nodes, and Alternative Studios

Practical Access: Telegram Bot for Personalized Songs

ACE-Step 1.5 vs Suno: Which to Choose

Bottom Line

Related posts:

Leave a Reply