News · 2026-06-26

Why teaching AI agents to use tools keeps blowing up in training

Training an AI to be a good agent, one that calls tools, reads the results, and chains many steps toward a goal, is supposed to get better with practice. In reality, this kind of training has a nasty habit of suddenly falling apart mid-run, with performance cratering for no obvious reason. A new paper, Why Multi-Step Tool-Use Reinforcement Learning Collapses and How Supervisory Signals Fix It, diagnoses the cause and offers a practical fix.

The background. After a model is first trained, it is often polished with reinforcement learning: it tries things, gets rewarded for good outcomes, and adjusts. For a tool-using agent, a single task might mean a dozen actions in sequence, call a search tool, read the result, call a calculator, format an answer, with reward only at the end. That long, multi-step structure is exactly what makes the training fragile, because a small problem early in the chain can cascade.

What the authors found is precise and a little surprising. The collapse is not the model forgetting how to use tools; the underlying skill stays intact. Instead, the training causes unexpected probability spikes in a few specific control tokens, the small structural markers that tell the system things like start a tool call here or stop. When the probability of one of these control tokens balloons out of proportion, it scrambles the agent's structured execution, the way a single stuck key on a keyboard can wreck an otherwise fine sentence. The capability is fine; the scaffolding that organizes it breaks.

An analogy. Picture a skilled chef who knows every recipe perfectly. Now imagine that during a stressful service, they develop a tic where they compulsively shout next, next, next out of turn. The cooking knowledge is untouched, but the kitchen's choreography falls apart because the control signals, the calls that coordinate the line, have gone haywire. The dishes come out wrong not because the chef forgot how to cook but because the timing commands got corrupted. That is what runaway control-token probabilities do to an agent.

The fix is to stop training purely by trial and error and weave in supervision. The researchers tested several kinds of guidance, including showing the model correct examples, giving it hints, and even deliberately showing it bad examples to learn from, and found that interleaving ordinary supervised learning with the reinforcement learning keeps the control tokens in check and the training stable. In short, pure self-directed practice destabilizes; mixing in a teacher who occasionally shows the right way steadies the whole process.

Why it matters: this is a load-bearing problem for the entire agent boom. Every company racing to ship agents that book travel, write and run code, or manage workflows has to train them on exactly these long, multi-step tool-use tasks, and instability in that training is a hidden tax that wastes expensive compute and produces unreliable agents. A clean diagnosis, the problem is control tokens, not lost capability, plus a concrete remedy, blend in supervision, is the kind of unglamorous result that quietly makes the next generation of agents more dependable. It pairs naturally with the week's other agent-reliability work, including research on rewarding agents without a clear referee.

The honest caveat: the fix is not free. The authors note that interleaving supervised training with reinforcement learning can hurt performance on out-of-distribution tasks, situations that look different from the training examples. That is a real trade-off: leaning on supervised examples stabilizes training but can also tether the agent to the patterns it was shown, making it a bit less adaptable when the world throws it something genuinely new. This is also a single study on specific setups, and like much agent-training research it will need replication across more models and tasks before the recipe is treated as settled. Still, naming the precise failure mode is a meaningful step, because you cannot fix what you cannot see.

Primary source, verified: read the paper → (arXiv 2606.26027)