Darwin-4B-David
4.5B merged model beats Gemma-4-31B on GPQA — no training needed
Expert verdict
Ship
3-1The Panel's Take
Darwin-4B-David is a 4.5-billion-parameter model that achieves 85.0% on GPQA Diamond — outperforming Google's Gemma-4-31B (84.3%) at roughly 1/7th the parameter count. The kicker: it required no training whatsoever. It was built in 45 minutes on a single H100 using MRI-guided DARE-TIES model merging, a novel variant of the merge-and-trim technique. The MRI-guided approach uses activation analysis to identify which parameters in each source model are most critical, then applies DARE-TIES merging only to the high-value weight regions. This avoids the catastrophic interference that usually degrades merged models. The result is a small model that inherits the strengths of multiple larger predecessors without any of the compute cost of fine-tuning. For the AI community, this is a meaningful data point: model merging continues to close the gap with expensive training runs. Darwin-4B-David demonstrates that thoughtful merge strategies can extract benchmark-level performance from models that are a fraction of the size, making capable AI more accessible on consumer hardware.
Share this verdict
Darwin-4B-David verdict: SHIP 🚀 3 ships · 1 skip from the expert panel Full review: shiporskip.io/tool/darwin-4b-david-model-merge-gpqa-diamond-85-no-training-45min-h100-2026
Weekly AI Tool Verdicts
Get the next verdict in your inbox
7 critics review a new AI tool every day. Weekly digest — free.
Similar Products
Compare Darwin-4B-David with Others
Looking for Darwin-4B-David alternatives?
Compare Darwin-4B-David with every other AI Models tool reviewed by our panel.
See all AI Models alternativesEmbed this verdict
Tool makers can add a live ShipOrSkip badge to their site. Badge loads track impressions; clicks route back to this review.
<a href="https://shiporskip.io/api/badge-click/darwin-4b-david-model-merge-gpqa-diamond-85-no-training-45min-h100-2026" target="_blank" rel="noopener"><img src="https://shiporskip.io/api/badge/darwin-4b-david-model-merge-gpqa-diamond-85-no-training-45min-h100-2026" alt="Darwin-4B-David Ship verdict on ShipOrSkip" width="360" height="90" /></a>[](https://shiporskip.io/api/badge-click/darwin-4b-david-model-merge-gpqa-diamond-85-no-training-45min-h100-2026)<iframe src="https://shiporskip.io/embed/darwin-4b-david-model-merge-gpqa-diamond-85-no-training-45min-h100-2026" title="Darwin-4B-David ShipOrSkip verdict" width="360" height="260" style="border:0;border-radius:16px;max-width:100%;" loading="lazy"></iframe>The reviews
“45 minutes on a single H100 to beat a 31B parameter model? That's an extraordinary efficiency ratio. MRI-guided merging is a technique I'll be watching closely. If this holds up across more benchmarks, it fundamentally changes how teams should think about building capable small models.”
“GPQA Diamond is one benchmark. One. Benchmark performance doesn't translate linearly to real-world task performance, especially for a merged model that hasn't been fine-tuned for instruction following or RLHF alignment. Impressive number, but I'd want to see this on coding, reasoning chains, and RAG tasks before getting excited.”
“Model merging is the dark horse of AI efficiency research. If MRI-guided DARE-TIES merging can reliably produce results like this, it suggests we're nowhere near the ceiling for extracting value from existing open-weight models. The future may involve less training and more intelligent composition.”
“A capable model in the 4-5B range that can run on a MacBook M-series is exactly what solo creators need for on-device inference. If Darwin-4B-David's performance holds on creative tasks, it's a genuine local creative AI for people without cloud budgets.”