AI Counted Carbs 27,000 Times — The Results Should Worry Every Med-AI Developer

A researcher asked multiple AI models to estimate carbohydrate content in identical food images 27,000 times. The variance between responses — from the same model, on the same image — was startling. The study is making waves on Hacker News as a blunt reminder that consistency is AI's unresolved problem in medical contexts.

Original source

A diabetes researcher ran one of the most methodically simple but revealing AI tests in recent memory: ask multiple large language models to count carbs from food images. Then ask them again. And again. 27,000 times in total, across multiple models, meals, and prompt variations.

The findings, published on Diabettech and currently trending on Hacker News with 206 points, expose a fundamental challenge: **the same model, given the same image and the same prompt, can give wildly different carbohydrate estimates across separate queries**. For a diabetic deciding insulin dosage, that variance isn't an interesting edge case — it's a safety hazard.

The study didn't single out any one model as especially bad. The inconsistency appeared across major providers. Some models had higher variance on complex plates; others struggled with packaged foods. No model achieved the kind of deterministic repeatability that a clinical context demands.

What makes this research particularly timely is the accelerating push to deploy AI in healthcare workflows. Apps that suggest insulin doses, calorie tracking tools powered by vision models, dietary AI embedded in continuous glucose monitors — all are betting on consistency that this data suggests isn't there yet.

The most-upvoted HN comment captures the community mood: *"This is why 'AI in medicine' papers that report accuracy metrics but not variance metrics are incomplete."* The study is a call for medical AI benchmarks to include consistency as a first-class measurement, not just accuracy averaged over a test set.

Panel Takes

The Builder

Developer Perspective

“Temperature settings and sampling parameters are tunable, but the underlying variance in vision-language models on nutritional estimation is a deeper problem than prompt engineering can fix. This needs to be a model card requirement.”

The Skeptic

Reality Check

“This study will be ignored by the AI-in-healthcare hype cycle for another 18 months, at which point there will be a high-profile incident, and everyone will act surprised. The research exists. The incentives to deploy fast exist harder.”

The Futurist

Big Picture

“Consistency metrics will become mandatory in medical AI regulation within two years. This study is the kind of pre-regulatory evidence that shapes FDA guidance. The field is being warned — the question is whether it listens.”

Panel Takes

Bookmarks