News · 2026-06-19

Turn around, and the world disappears

A growing class of AI doesn't just generate a video clip — it's meant to hold a model of the world in its head: a place with objects that have positions and keep existing whether or not the camera is pointed at them. These "world models" are a big deal because they're the imagined sandbox a robot could plan inside, or the engine behind a game that builds itself as you walk through it. DeepMind's Genie 2, for instance, can turn a single still image into a little 3D world you can actually walk around in. For any of that to be useful, the world has to stay put when you look away.

A new benchmark set out to check whether it does, with a test anyone can picture. Show the model a scene, pan the camera away for a moment, then pan back. A cat that was mid-leap toward the bed should, by the time you return, be on the bed — or at least somewhere that makes sense given a second has passed. Instead, again and again, the model snaps things back to how it last saw them. The cat is still on the floor, frozen in the same spot. A door someone had pushed open is closed again. A stack of blocks you knocked over is neatly restacked. The world didn't keep running while you weren't watching; it quietly reset to the last frame it remembered.

The most surprising result is which models do this worst. You'd expect bigger, more capable systems to keep better track. They don't — on this particular skill, scaling up tends to make the forgetting worse, not better. That's a strong clue the problem isn't "not enough horsepower." It's structural: these systems are superb at painting whatever is in frame right now, and have no real place to store the parts that have scrolled off-screen. They're less like a mind holding a scene in memory and more like an extraordinarily talented improviser who only knows what's directly in front of them — ask them to remember the corner they just turned away from and there's simply nowhere it was written down.

To make the gap concrete, picture a kitchen robot. A cup rolls behind the toaster. A person reaches in front of the camera. When the view clears, a model with no memory doesn't think "the cup is still behind the toaster" — it re-paints the scene from scratch, and the cup may be gone, or back where it started, or somewhere new entirely. You cannot plan a reliable grab against a world that rewrites itself every time something blocks the view. The same goes for a game you can turn your back on: walk down a corridor, turn around, and the room you just left has silently rearranged its furniture.

This connects to a quietly important theme running through the week's research. A separate paper on giving robots real spatial tools lands on the same missing ingredient from a different angle — persistent memory of where things are across multiple glances — while another argues robots might skip the imagined video entirely and plan from a single still frame, sidestepping the forgetting problem rather than solving it. Three groups, three directions, all circling the same gap. When that happens, it usually means a real weakness has been found rather than a one-off complaint.

The researchers make the case that fixing this needs a genuinely different ingredient — something that acts as a persistent "state of the world," a memory the model writes to and reads back, kept separate from the picture it happens to be drawing at any moment. Today's models fold "what's true about the scene" and "what pixels go on screen right now" into one step, and the truth gets overwritten every time the picture changes. Splitting those apart — a lasting ledger of the world plus a renderer that draws from it — is the direction several teams are now pointing.

Why it matters comes down to what we want these systems for. A model that forgets the room the instant you look elsewhere can still make a gorgeous six-second clip — and that's genuinely useful for film and art. But it can't be the dependable imagination inside a robot deciding where to reach, or a game world you can explore and trust to stay consistent. This benchmark turns a vague intuition — "these things don't really understand space" — into a specific, measurable failure that the next wave of research now has to beat.

The usual caveat: the work is days old and measures one particular kind of forgetting, so it's a sharp diagnosis rather than the final word, and a system that fails this test isn't worthless at everything else. But it's the kind of clean, almost playful experiment — turn around and see if the world is still there — that tends to stick, because anyone can understand exactly what's being asked, and exactly how today's models come up short.

Primary source, verified: read the paper → (arXiv 2606.20545)