Question 1

Which is better: Qwen3.5-Omni or VoxCPM2?

Accepted Answer

Based on our expert panel, Qwen3.5-Omni has a stronger verdict with a 75% Ship rate. Qwen3.5-Omni received a panel verdict of Ship and VoxCPM2 received Ship.

Question 2

Is Qwen3.5-Omni free?

Accepted Answer

Qwen3.5-Omni pricing: Proprietary / API (Alibaba Cloud)

Question 3

Is VoxCPM2 free?

Accepted Answer

VoxCPM2 pricing: Free / Open Source

Question 4

What do experts say about Qwen3.5-Omni vs VoxCPM2?

Accepted Answer

Qwen3.5-Omni: Qwen3.5-Omni is Alibaba's most advanced multimodal model yet — a native Thinker-Talker architecture that processes and generates text, audio, and video in a single unified system. Released in three variants (Plus, Flash, Light), it supports a 256k context window, 10+ hours of audio, and 400 seconds of 720p video at 1 FPS, with speech recognition across 113 languages and dialects.

The headline capability is what Alibaba is calling "Audio-Visual Vibe Coding" — an emergent behavior where the model writes functional code based solely on watching a video and listening to spoken instructions. In demos, it takes a hand-drawn sketch held up to a camera and converts it into a working React webpage in real time. This wasn't an explicitly trained capability; it emerged from the model's unified multimodal architecture.

The model uses semantic interruption and turn-taking intent recognition for real-time interaction, and TMRoPE for temporal multimodal position encoding. The catch: Alibaba broke from its open-source streak and kept Qwen3.5-Omni proprietary, accessible only through their chatbot interface and Alibaba Cloud. The open-source community has noticed — and is not pleased. VoxCPM2: VoxCPM2 is a 2-billion-parameter text-to-speech model from OpenBMB that scraps discrete tokenization entirely, working directly in continuous latent space via a diffusion autoregressive architecture. Unlike dominant TTS approaches (VALL-E, Tortoise, XTTS), it never converts audio to discrete tokens — diffusion handles the full generation pipeline, resulting in 48kHz studio-quality output.

It supports 30 languages without requiring language tags, zero-shot voice cloning from reference audio, and — most distinctly — voice design from pure natural-language descriptions. You can prompt "a warm, slightly raspy woman in her 40s who sounds like a news anchor" and get a consistent new voice without providing any reference audio. Trained on 2M+ hours of multilingual data.

Released under Apache 2.0, making it commercially usable. The architecture diverges meaningfully from existing open-source TTS options and introduces a novel UX primitive (describe a voice, get a voice) that could reshape how developers approach voice synthesis in products.

Qwen3.5-Omni vs VoxCPM2

Qwen3.5-Omni

VoxCPM2

Bookmarks