Question 1

Which is better: ACE-Step 1.5 XL or Voicebox?

Accepted Answer

Based on our expert panel, ACE-Step 1.5 XL has a stronger verdict with a 100% Ship rate. ACE-Step 1.5 XL received a panel verdict of Ship and Voicebox received Ship.

Question 2

Is ACE-Step 1.5 XL free?

Accepted Answer

ACE-Step 1.5 XL pricing: Free / Open Source

Question 3

Is Voicebox free?

Accepted Answer

Voicebox pricing: Free / Open Source

Question 4

What do experts say about ACE-Step 1.5 XL vs Voicebox?

Accepted Answer

ACE-Step 1.5 XL: ACE-Step 1.5 XL is an open-source music generation foundation model jointly developed by ACE Studio and StepFun. Released April 2, 2026, the XL variant adds a 4-billion-parameter Diffusion Transformer decoder for significantly higher audio quality over the base model, available in three variants: xl-base, xl-sft, and xl-turbo.

The architecture pairs a Language Model (which acts as a planner, transforming user prompts into song blueprints with metadata, lyrics, and captions) with a Diffusion Transformer that generates the actual audio. Speed is a headline feature: under 2 seconds per full song on an A100, under 10 seconds on an RTX 3090, and it runs with less than 4GB VRAM. It supports LoRA personalization from just a handful of reference songs, making custom style training accessible to anyone.

ACE-Step supports full song generation with lyrics, instruments, multiple genres, and multi-track control. The model runs locally on Mac (Apple Silicon), AMD, Intel, and CUDA devices. Community-built UIs like ace-step-ui give non-technical users a polished interface. This is now widely regarded as the best open-source music generation option available — outperforming most commercial alternatives at zero cost. Voicebox: Voicebox is an open-source, local-first voice synthesis studio that bundles seven TTS engines — including Qwen3-TTS, LuxTTS, and Kokoro — into a single desktop app with a podcast-style multi-track timeline editor. Everything runs on-device across macOS, Windows, and Linux, with zero data leaving your machine.

Beyond basic TTS, it supports zero-shot voice cloning from a short reference clip, 23 languages, 50+ preset voices, and post-processing audio effects (reverb, noise reduction, EQ). A REST API ships alongside the GUI, so developers can integrate it into pipelines without leaving the local paradigm.

With over 20k GitHub stars and trending this week, Voicebox positions as a fully local ElevenLabs alternative — not just a one-off TTS wrapper but a genuine production tool. The multi-engine approach means you can route different speakers in a conversation to different models based on quality/speed tradeoffs.

ACE-Step 1.5 XL vs Voicebox

ACE-Step 1.5 XL

Voicebox

Bookmarks