The M×N Tool Calling Problem — Why Open-Source AI Can't Agree on How to Call a Function

A new analysis formalizes the M×N tool calling problem: M applications (vLLM, SGLang, TensorRT-LLM) each write custom parsers for N model families, each with incompatible wire formats for the same operations, creating exponential maintenance complexity with no shared contract.

Original source

A blog post published this week has struck a nerve on Hacker News, formalizing a problem that anyone building multi-model AI applications has run into but lacked clean language for: every model family encodes tool calls differently, and every framework that supports multiple models must independently maintain parsers for each format.

The same operation — calling `search(query="GPU")` — uses three distinct wire formats across GPT-compatible, DeepSeek, and GLM5 models: different token vocabularies, different boundary markers, different argument serialization schemes. The result is N models × M implementations of the same format knowledge, each reverse-engineered independently by vLLM, SGLang, TensorRT-LLM, LangChain, and every other framework that wants to work with multiple providers.

The analysis distinguishes two separate systems that each embed this format knowledge: grammar engines (which constrain generation to produce valid tool calls) and output parsers (which extract structured results from completions). Both must be updated independently when a new model arrives, and both embed format knowledge that should be a shared, declarative specification rather than repeated engineering work.

The proposed solution mirrors how the ecosystem standardized chat templates: a declarative specification format that separates format knowledge from implementation code. When a new model ships, teams update the spec file rather than rewriting parsers across the stack. Grammar engines and parsers consume the spec rather than hardcoding format assumptions.

The post arrives at a moment when MCP (Model Context Protocol) is establishing one layer of tool calling standardization — how tools are defined and discovered — without addressing the wire format layer of how different models actually encode tool invocations in their outputs. The gap between these layers is exactly where the M×N problem lives, and filling it is likely to be the next major standardization effort in the AI infrastructure stack.

Panel Takes

“This is the most articulate description of a problem I've been working around for months. Every model family — GPT, DeepSeek, GLM5, Llama — encodes tool calls differently, so every framework that wants to support multiple models has to maintain N separate parsers for M implementations. The proposed solution (declarative spec files, similar to chat templates) is the right abstraction: update the spec when a new model arrives, not the code. Someone needs to build this and submit it to a standards body.”

“The M×N problem is real, but it's also a symptom of a moving target. Model vendors are actively experimenting with tool calling formats because we don't yet know what works best. Standardizing too early locks in the wrong abstraction — see: the history of XML web services and early REST API conventions. The right answer might simply be 'model providers converge on OpenAI's format, everyone else adapts,' which is already happening. The community blog post correctly identifies the problem but may be solving something the market is already resolving.”

“Tool calling is the primary interface through which LLMs act on the world, and incompatible wire formats are the equivalent of browsers with incompatible HTML in 1999. The W3C moment for AI tool calling is coming — whether driven by a standards body, a dominant provider, or an open-source consortium. This paper will be cited in that effort. The MCP protocol is an attempt at one layer of this, but the format standardization layer remains unsolved.”

Panel Takes

Bookmarks