Stability AI's Audio 3.0 Generates Six-Minute Songs, Runs On-Device
Stability AI has released Stability Audio 3.0, capable of generating up to six-minute songs, alongside a smaller variant that runs on-device and produces two-minute tracks. The release extends the company's push into generative audio as competition in the space intensifies.
Original sourceStability AI has launched Stability Audio 3.0, its latest generative audio model, which can produce songs up to six minutes in length — a significant jump from prior generation limits that have hampered real-world music use cases. A smaller, distilled version of the model is also available and designed to run on-device, trading generation length (capped at two minutes) for the latency and privacy benefits of local inference.
The six-minute ceiling matters because it finally clears the threshold for a full song structure — intro, verse, chorus, bridge, outro — without requiring stitching or awkward looping. Prior AI audio tools, including earlier Stability Audio releases, often topped out around 30 to 90 seconds, making them useful for background loops or jingles but not cohesive musical compositions.
The on-device small model is the more technically interesting development. Running audio generation locally eliminates round-trip latency, enables offline use, and removes the privacy concerns associated with sending audio prompts and stems to a cloud API. For developers building in mobile, edge, or embedded contexts, this opens use cases that a cloud-only model simply can't serve.
Stability AI has had a turbulent few years — leadership departures, funding challenges, and questions about its long-term viability — but its model releases have continued at a steady pace. Audio 3.0 enters a market that includes Suno, Udio, and increasingly capable offerings from larger players, and it will need to differentiate on either output quality, API access terms, or the on-device story to carve out durable adoption.
Panel Takes
The Builder
Developer Perspective
“The on-device small model is the primitive I actually want — local inference means I can ship it in a mobile app without a cloud dependency or a per-call billing conversation with my PM. What I need to know before I touch it: what's the model format (ONNX, CoreML, something proprietary?), what are the hardware requirements, and is there a real SDK or am I wrapping a binary? If the docs answer those three questions in the first scroll, this is a ship. If I have to dig through a Discord to find out it requires a specific GPU tier, it's a skip regardless of output quality.”
The Skeptic
Reality Check
“Six minutes is a real benchmark, but it means nothing if the output degrades past the 90-second mark — which has been the dirty secret of most long-form audio generation. Suno and Udio have both claimed structural coherence at scale and delivered it inconsistently; I'd want to hear 20 randomly sampled six-minute outputs before calling this solved. The on-device angle is the only genuinely differentiated claim here, and what kills this in 12 months is Apple or Google shipping comparable local audio generation natively in their creative SDKs, which is not a far-fetched scenario given the on-device model trend.”
The Creator
Content & Design
“Six minutes is the first generation length that actually maps to how songs are structured by humans, not by what a model could hold together — that's a meaningful craft threshold, not a marketing number. But the question I care about is whether the output at minute four still sounds intentional or whether it starts to drift into generically competent background music, because that's where every other tool falls apart. There's no public demo gallery called out in the announcement, which means I can't evaluate whether the taste layer is baked into the model or outsourced entirely to the prompt — and that's the whole game for whether this replaces anything in my actual workflow.”
The Founder
Business & Market
“The on-device model is the only defensible wedge here — it's the one thing Suno and Udio can't trivially replicate because it requires a fundamentally different training and distillation investment, not just a UI decision. The buyer for that story is clear: game studios, mobile app developers, and hardware OEMs who can't send user audio context to a cloud and can't absorb per-call API costs at scale. The risk is that Stability's balance sheet doesn't give them enough runway to own that developer relationship before a better-capitalized competitor ships a comparable on-device model and bundles it into an existing SDK.”