AI tool comparison
Google Gemma 4 vs Ternary Bonsai
Which one should you ship with? Here is the side-by-side panel verdict, pricing read, reviewer split, and community vote comparison.
Open Source Models
Google Gemma 4
Google's first Apache 2.0 open model family with native multimodal
75%
Panel ship
—
Community
Free
Entry
Gemma 4 is Google's newest open model family — E2B, E4B, 26B, and 31B sizes — built on Gemini 3 architecture. For the first time, Google has released Gemma under Apache 2.0, making the models fully commercial-friendly with no Google-specific use restrictions. Every model in the family is natively multimodal from training: text, image, video, and audio inputs are all first-class. Context windows run 128K–256K tokens depending on size, and the models include built-in function calling, structured JSON output, and agentic workflow support. The E2B and E4B variants target on-device mobile and laptop deployment, with native audio understanding designed for always-on assistant scenarios. NVIDIA has already published optimized Gemma 4 containers for RTX hardware. The Apache 2.0 license removes a major adoption barrier that held back Gemma 3 in commercial products. Gemma 4 landed at #1 on Hacker News with 1,400+ points — the open-source model community's reaction was immediate and enthusiastic.
Open Source Models
Ternary Bonsai
1.58-bit LLMs that run at 82 tok/s on M4 Pro and on your iPhone
75%
Panel ship
—
Community
Free
Entry
PrismML's Ternary Bonsai is a family of aggressively quantized language models that take the BitNet concept to its logical extreme. Each weight is constrained to one of three values — {-1, 0, +1} — with a shared FP16 scale factor per 128-weight group. No higher-precision escape hatches, no hybrid layers. The result is a 9x reduction in memory footprint versus standard 16-bit models. The numbers are striking: the 8B model fits in 1.75 GB and hits 82 tokens per second on an M4 Pro. More impressively, it runs at 27 tokens per second on an iPhone 17 Pro Max — fast enough for real-time conversation on-device. The 8B variant scores 75.5 average across standard benchmarks, outperforming many models that are 9-10x larger. The 4B and 1.7B variants push further into mobile-optimized territory. All three models are released under the Apache 2.0 license, available on Hugging Face and GitHub, and integrated into the Locally AI iOS app for immediate on-device deployment. For developers building privacy-sensitive applications or anyone tired of paying cloud inference costs, Ternary Bonsai offers a compelling on-device alternative that doesn't require a beefy GPU.
Reviewer scorecard
“Apache 2.0 means I can embed it in commercial products without legal review overhead. Native audio + 256K context on a 26B model that runs on a single A100 is a killer combo for production agent work. This is the open model I've been waiting for.”
“82 tokens per second on M4 Pro in 1.75 GB is a genuinely impressive engineering achievement. For local tooling, code assistants, or any latency-sensitive workload where I don't want cloud round-trips, this hits a sweet spot that larger quantized models miss. Apache 2.0 means I can embed it in commercial apps without legal headaches.”
“Google has a history of releasing models and then quietly deprioritizing them once the PR cycle ends. Gemma 1 and 2 both got less maintenance than promised. The Apache license is great news, but trust has to be earned over time with consistent model updates.”
“A 75.5 benchmark average sounds good until you compare it against 8B models quantized with GGUF Q8 — which score similarly and have years of tooling, community support, and production deployments behind them. The 9x memory savings matter on constrained devices but less so on any machine with 16GB+ RAM. Niche but real use case.”
“Native multimodal understanding — including audio — on models small enough for phones changes what ambient computing looks like. Gemma 4 on-device could be the model layer for a generation of always-on smart devices that don't need cloud inference.”
“On-device AI at 27 tokens per second on a phone is the inflection point that makes LLMs a platform primitive rather than a cloud service. Once inference is this cheap and fast on commodity hardware, the entire economic model of AI-as-API-call collapses. Ternary quantization is an early signal of where efficiency research is heading.”
“Image, video, and audio in one open model I can run locally? The creative tooling possibilities are enormous. I can build private multimodal workflows for client work without data leaving my machine. Apache 2.0 seals it — this is a Ship.”
“The prospect of running a capable LLM entirely on my iPhone without sending any data to a server is genuinely exciting for creative work with sensitive material. Drafting, editing, and ideation without a cloud subscription or privacy concerns — I'd pay for that, and here it's free.”
Weekly AI Tool Verdicts
Get the next comparison in your inbox
New AI tools ship daily. We compare them before you waste an afternoon.