News · 2026-06-19

Giving an AI real spatial tools instead of letting it guess

Today's vision AIs are dazzling at describing a picture — they'll tell you it's a sunny kitchen with a mug on the counter and a laptop beside it. Ask them something precise about space, though, and they get shaky: How far is the mug from the laptop? Is it to the left or right from where you're standing? Would it fit on the shelf above? On questions like these, the models tend to do something very human and very unreliable — they eyeball it and guess. A new system takes a different tack: instead of asking one model to intuit 3D geometry in its head, it lets the model reach for the right instrument.

The setup treats the AI less like an all-knowing oracle and more like a smart project manager. When a spatial question comes up, it doesn't try to feel out the answer; it calls a specialized tool for the job — one that precisely locates objects in the flat image, another that reasons about actual 3D geometry and distance, another that knows general facts about how space and objects work — and then combines what those tools report. Each tool does one narrow thing well, and the model's job is to pick the right one and assemble the pieces, rather than to be secretly good at everything at once.

Crucially, it also keeps a memory across multiple views. Glance at the room from one angle, then another, and it stitches those glimpses into a single consistent picture rather than treating each frame as a fresh, amnesiac snapshot. That persistent memory is exactly the ingredient a separate result this week found missing in AI world models, which forget whatever drifts off-screen. Seeing two different teams converge on "you need a lasting record of where things are, not just a pretty picture of the current frame" is a good sign that the field has found a real, shared gap.

The striking result is that the open, freely-available version of this system reportedly holds its own against the big closed, commercial models on these spatial tasks — which usually win on raw scale. That's a recurring theme worth noting: when a problem has a clear sub-structure, "let the model orchestrate the right specialized tools" often beats "make one giant model bigger and hope spatial sense emerges." It mirrors how a person actually answers a hard distance question — not by staring harder, but by grabbing a tape measure. We don't expect a brilliant novelist to also be a surveyor; we expect them to know when to call one.

To picture why this matters, think about the machines we want to act in the physical world. A pair of AR glasses telling you "the exit is twelve feet to your right, behind the pillar" has to be right about that, not vibes-right. A home robot reaching for a dropped pill bottle has to know exactly where it is in three dimensions, and remember it's still there after someone walks past and blocks the view. These are the situations where a confident spatial guess isn't just wrong, it's useless or dangerous. It's the same precise-spatial demand that makes a task like a robot seating a graphics card into a motherboard so hard — millimetres matter, and "roughly there" fails.

The deeper point is about how AI gets good at the physical world at all. There are two philosophies in tension: make one enormous model and hope competence emerges from sheer scale, or build a capable orchestrator that knows which specialized tools to call and how to combine them. This paper is a strong data point for the second camp, at least for spatial reasoning — a domain that's about as structured and rule-governed as the real world gets, which is exactly where dedicated tools should shine.

The honest limits: this is days-old research, measured on a specific battery of spatial tasks, and "matches the closed models" is a claim made against particular benchmarks rather than the messy real world. Wiring up a pile of specialized tools also adds complexity and new ways to fail compared to one self-contained model — every tool is another thing that can break or be called at the wrong moment. But the direction is compelling, because it lines up with where the field keeps landing: for problems that have real structure — and 3D space is about as structured as it gets — teaching an AI to use the right tool tends to beat asking it to wing the whole thing in its head.

Primary source, verified: read the paper → (arXiv 2606.20515)