P

Parlor

Full voice + vision AI running locally on your Mac — no cloud needed

PriceFree / Apache 2.0Reviewed2026-04-08

Expert verdict

Ship

3-1
3 Ships1 Skips
Visit github.com

The Panel's Take

Parlor is an on-device real-time multimodal AI application that runs an end-to-end audio+video understanding and voice response loop entirely on local hardware — no API keys, no servers, no data leaving the machine. The creator built it to power a free English-learning platform without incurring ongoing server costs. It captures microphone and camera input, sends them through Gemma 4 E2B via LiteRT-LM on the GPU for comprehension, and returns synthesized speech via Kokoro TTS — all with an end-to-end latency of 2.5 to 3 seconds on an Apple M3 Pro. The stack is deliberately lean: browser-based voice activity detection (VAD), streaming audio output to minimize perceived latency, mid-response interruption support, and a total model download of roughly 2.6 GB. It's written in Python and requires no special setup beyond downloading the models. Apache 2.0 licensed. Parlor surfaced on Hacker News with over 280 points — an unusually strong signal for a one-developer demo project. The reaction reflects a broader shift: multimodal voice AI that required server-grade hardware six months ago now runs on consumer MacBooks, and open-source developers are starting to ship production-ready applications built entirely on that foundation.

Share this verdict

Parlor verdict: SHIP 🚀

3 ships · 1 skip from the expert panel

Full review: shiporskip.io/tool/parlor-on-device-local-multimodal-voice-vision-ai-gemma4-litert-kokoro

Weekly AI Tool Verdicts

Get the next verdict in your inbox

7 critics review a new AI tool every day. Weekly digest — free.

Looking for Parlor alternatives?

Compare Parlor with every other Audio & Voice tool reviewed by our panel.

See all Audio & Voice alternatives

Embed this verdict

Tool makers can add a live ShipOrSkip badge to their site. Badge loads track impressions; clicks route back to this review.

Ship · 7.5/10
HTML badge
<a href="https://shiporskip.io/api/badge-click/parlor-on-device-local-multimodal-voice-vision-ai-gemma4-litert-kokoro" target="_blank" rel="noopener"><img src="https://shiporskip.io/api/badge/parlor-on-device-local-multimodal-voice-vision-ai-gemma4-litert-kokoro" alt="Parlor Ship verdict on ShipOrSkip" width="360" height="90" /></a>
Markdown badge
[![Parlor Ship verdict on ShipOrSkip](https://shiporskip.io/api/badge/parlor-on-device-local-multimodal-voice-vision-ai-gemma4-litert-kokoro)](https://shiporskip.io/api/badge-click/parlor-on-device-local-multimodal-voice-vision-ai-gemma4-litert-kokoro)
Iframe widget
<iframe src="https://shiporskip.io/embed/parlor-on-device-local-multimodal-voice-vision-ai-gemma4-litert-kokoro" title="Parlor ShipOrSkip verdict" width="360" height="260" style="border:0;border-radius:16px;max-width:100%;" loading="lazy"></iframe>

The reviews

2.5–3 second end-to-end latency for full voice + vision on a MacBook is genuinely remarkable. The architecture is clean — VAD in the browser, LiteRT-LM on GPU for the heavy lifting, Kokoro for TTS. This is a solid foundation for building privacy-first voice assistants, tutors, or accessibility tools without any ongoing API costs.

Helpful?

Three-second latency is still noticeably clunky for natural conversation — OpenAI and Google's voice APIs run in under a second. On older Macs or non-Apple hardware the latency will be worse. It's a proof of concept, not a daily driver, and the model quality gap between Gemma 4 E2B and GPT-4o voice is real.

Helpful?

The trajectory here is the story. If M3 Pro hits 3 seconds today, M5 will hit under 1 second in 18 months. Every capability improvement in edge chips directly translates to closed-loop multimodal AI as a baseline feature of devices. Parlor is one of the first working demos of where all consumer devices are headed.

Helpful?

For language tutoring, creative storytelling tools, or interactive audio-visual demos, having no cloud dependency means total privacy for learners and zero recurring costs for creators. The English-learning use case the creator shipped it for is exactly the kind of high-impact low-resource application this technology should be enabling.

Helpful?

Bookmarks

Loading bookmarks...

No bookmarks yet

Bookmark tools to save them for later