News · 2026-06-20

AI 'world models' have short-term memory — they forget what's off-screen

One of the most exciting ideas in AI right now is the world model — a system that learns how an environment behaves and can predict what happens next, the way you can guess that a dropped glass will shatter or that a ball rolling off a table will fall. World models matter because they're a path toward AI that can plan, imagine consequences, and act in the physical world rather than just chatting about it. But a broad new study argues that today's world models have a basic and revealing flaw: they can predict the next moment, but they don't actually remember the world.

The paper, Current World Models Lack a Persistent State Core, runs a large, systematic test — thousands of generated videos spanning more than twenty different models and several styles of control. The pattern it uncovers is consistent and a little unsettling. When an object or part of the scene leaves the frame and then comes back, the model doesn't continue the version of reality it had before. Instead it "resumes an abandoned state" — it improvises a fresh version of whatever wandered out of view.

The authors' own analogy is the right one: it's like a video game that regenerates a room the moment you turn your back. Walk away from a table you've set, turn around, and the cups have rearranged themselves. The world looks plausible at every instant, but it isn't continuous. There's no stable, enduring record of "how things are" — only a talented improviser filling in the next frame from whatever it can currently see.

Why does this happen? Most of these systems are extraordinarily good at short-term prediction. Given the last few seconds, they produce a convincing next few seconds. But that skill is local. They don't carry a durable, internal ledger of the whole environment — what the researchers call a persistent state core — that keeps evolving even for the parts nobody is watching. Out of sight is, quite literally, out of mind. Human cognition does the opposite: you maintain a rough mental map of your kitchen even with your eyes closed, and you'd be startled if the layout changed when you looked again. That sense of object permanence — the knowledge that things keep existing and keep behaving even when unobserved — is exactly what these models lack.

To make the problem measurable rather than anecdotal, the team built a diagnostic test suite that deliberately stresses these weak spots: moving a camera away from something and back, checking whether a scene stays coherent over time, and checking whether a target you return to is still the way you left it. It's essentially a memory exam for world models, and most of the models studied don't pass cleanly.

Why it matters: the entire promise of world models is that an AI could use one to plan — to mentally simulate a path through a warehouse, anticipate how a stack of objects will settle, or reason about a scene over minutes rather than moments. Every one of those tasks demands consistency over time. A planner built on a model that quietly rewrites the off-screen world will make confident plans grounded in a reality that keeps shifting underneath it. The flashy demos — gorgeous, physically plausible short clips — can hide this, because a few seconds rarely expose the memory gap. Stretch the horizon, or simply look away and back, and the cracks show.

The paper's prescription is a shift in design priorities: build models around a stable internal "physical state" that persists and evolves regardless of what the camera is pointed at, rather than chasing ever-prettier short clips. That's easier proposed than done. A genuinely persistent state has to track an enormous amount about a scene, keep it consistent as everything interacts, and do so without ballooning the computation — a hard engineering problem the paper diagnoses more than it solves.

The honest caveat: this is a critique with a measuring stick attached, not a finished cure. The new test suite is itself a proposal that the field has to adopt and pressure-test, and "add a persistent memory" can mean many different architectures, not all of which will pan out. But the contribution is clarifying. It moves the world-model conversation away from "look how realistic this clip is" toward the harder, more important question: does this system actually believe in a world that's still there when it stops looking? For now, mostly, it doesn't.

Primary source, verified: read the paper → (arXiv 2606.20545)