ICLR 2026 Paper: Making LLMs Reason Harder Makes Them Hallucinate Tool Calls More

A new paper from ICLR 2026 found that reinforcement learning used to enhance LLM reasoning simultaneously increases tool hallucination — the models invent non-existent API calls more often. Task accuracy and hallucination rates rise together, not in opposition.

Original source

A paper published at ICLR 2026 titled "The Reasoning Trap: How Enhancing LLM Reasoning Amplifies Tool Hallucination" has surfaced a troubling counterintuitive result that's sending ripples through the AI safety and agent reliability communities.

The study's central finding: when language models undergo reinforcement learning to improve their reasoning capabilities — the same training that makes models better at complex multi-step problems — they simultaneously become more likely to invent non-existent tool calls. The benchmark used, **SimpleToolHalluBench**, tests whether agents refuse impossible tasks (correct behavior) or fabricate solutions using tools that don't exist (hallucination). The results showed task accuracy and hallucination rates increasing in tandem.

The mechanism identified by the researchers is specific: reasoning enhancement training "disproportionately collapses tool-reliability-related representations" in late network layers — exactly where safeguards against false tool invocations are supposed to operate. In other words, making a model reason better actively undermines its ability to recognize when it should refuse.

This matters enormously given the scale of agent deployment: 96% of enterprises now run AI agents in production, and 47% of enterprise AI users have previously based major decisions on hallucinated content according to Deloitte research. In agentic pipelines where one model's tool call feeds into the next agent's context, a single fabricated API call can cascade into compound errors across an entire workflow.

The researchers recommend "no-tool" evaluations before deployment — testing specifically whether models hallucinate non-existent capabilities — and human approval checkpoints for high-stakes tool calls, particularly in payroll, HR, and financial workflows. The paper arrives as companies are rapidly expanding the autonomy of their agent deployments, making its timing particularly pointed.

Panel Takes

The Builder

Developer Perspective

“This is why I test every new model version on my exact tool schema before upgrading in production. The no-tool evaluation recommendation should honestly be a standard gate in every CI/CD pipeline that deploys agents — surprised it isn't already common practice.”

The Skeptic

Reality Check

“The paper is important but the extrapolation to HR and payroll risks in the coverage feels like motivated reasoning for a specific industry vertical. The core finding is real — RL training creates an accuracy/hallucination tradeoff — but the disaster scenarios need more empirical grounding to go with the alarm.”

The Futurist

Big Picture

“The adversarial relationship between reasoning capability and hallucination resistance is one of the most important unsolved problems in AI right now. Tools like Plurai and the broader eval infrastructure category exist precisely because this problem is getting worse as models get better — the market is responding in real time.”

Panel Takes

Bookmarks