V

VoxCPM2

Tokenizer-free TTS: voice design, cloning, and 30 languages from 2B params

PriceOpen SourceReviewed2026-04-13

Expert verdict

Ship

3-1
3 Ships1 Skips
Visit github.com

The Panel's Take

VoxCPM2 is an open-source text-to-speech system from OpenBMB that takes a fundamentally different architectural approach to speech synthesis. Instead of the discrete tokenization pipeline used by most modern TTS systems, VoxCPM2 operates entirely in latent space through a diffusion autoregressive pipeline — bypassing tokenization altogether. The 2B-parameter model was trained on over 2 million hours of multilingual speech and supports 30 languages plus 9 Chinese dialects with no language tagging needed. What makes VoxCPM2 stand out is its three-mode voice control system. "Voice Design" lets you create entirely new voices from natural language descriptions alone — "young woman, gentle voice, slightly husky" — no reference audio required. "Controllable Voice Cloning" takes a reference clip and lets you adjust style and emotion. "Ultimate Cloning" provides maximum fidelity by supplying both the reference audio and its transcript. Output quality is 48kHz studio-grade audio, and the model runs at RTF ~0.3 on an RTX 4090 (or ~0.13 with Nano-vLLM acceleration). The Apache 2.0 license makes VoxCPM2 commercially viable for builders who've been held back by restrictive TTS licensing. It benchmarks competitively with commercial models on Seed-TTS-eval across English and Mandarin. The Hugging Face demo is live, weights are published, and it installs via `pip install voxcpm`. For any developer building voice products, this is worth evaluating immediately.

Share this verdict

VoxCPM2 verdict: SHIP 🚀

3 ships · 1 skip from the expert panel

Full review: shiporskip.io/tool/voxcpm2-openbmb-tokenizer-free-tts-voice-design-cloning-30-languages-2026

Weekly AI Tool Verdicts

Get the next verdict in your inbox

7 critics review a new AI tool every day. Weekly digest — free.

Looking for VoxCPM2 alternatives?

Compare VoxCPM2 with every other Audio & Voice tool reviewed by our panel.

See all Audio & Voice alternatives

Embed this verdict

Tool makers can add a live ShipOrSkip badge to their site. Badge loads track impressions; clicks route back to this review.

Ship · 7.5/10
HTML badge
<a href="https://shiporskip.io/api/badge-click/voxcpm2-openbmb-tokenizer-free-tts-voice-design-cloning-30-languages-2026" target="_blank" rel="noopener"><img src="https://shiporskip.io/api/badge/voxcpm2-openbmb-tokenizer-free-tts-voice-design-cloning-30-languages-2026" alt="VoxCPM2 Ship verdict on ShipOrSkip" width="360" height="90" /></a>
Markdown badge
[![VoxCPM2 Ship verdict on ShipOrSkip](https://shiporskip.io/api/badge/voxcpm2-openbmb-tokenizer-free-tts-voice-design-cloning-30-languages-2026)](https://shiporskip.io/api/badge-click/voxcpm2-openbmb-tokenizer-free-tts-voice-design-cloning-30-languages-2026)
Iframe widget
<iframe src="https://shiporskip.io/embed/voxcpm2-openbmb-tokenizer-free-tts-voice-design-cloning-30-languages-2026" title="VoxCPM2 ShipOrSkip verdict" width="360" height="260" style="border:0;border-radius:16px;max-width:100%;" loading="lazy"></iframe>

The reviews

Apache 2.0 + pip install + 48kHz output is the holy grail for voice product builders. Most open TTS models either sound robotic, have restrictive licenses, or require complex setup. VoxCPM2 clears all three bars. The voice design feature alone changes how you prototype voice UX — describe the persona instead of recording it.

Helpful?

RTF of 0.3 on an RTX 4090 means real-time generation requires serious hardware — most small builders can't run this locally at scale. The technical report isn't published yet, so the benchmark claims are harder to independently verify. And 30 languages sounds impressive until you check whether your target dialect is actually well-represented in those 2M training hours.

Helpful?

The shift away from discrete tokenization in TTS is architecturally significant — it mirrors the same trajectory that diffusion models took in image generation, and look how that ended. VoxCPM2 is an early signal that the tokenize-everything paradigm in audio is starting to crack. The end state is real-time, hyper-expressive voice synthesis running on consumer hardware.

Helpful?

Designing voices with natural language instead of recording sessions is a genuine workflow unlock for content creators and game developers. The ability to describe 'tired, slightly gruff narrator in his 50s' and get consistent output is something I've wanted for years. The 48kHz output quality means it's usable in professional audio contexts without upsampling.

Helpful?

Bookmarks

Loading bookmarks...

No bookmarks yet

Bookmark tools to save them for later