Learn · Intermediate

What are world models?

A chess-playing AI doesn't need to understand the physical world — it just needs to know the rules and how to search through possible moves. But a robot in a kitchen needs something richer. It needs to know that water pours downward, that a hot pan stays hot after the burner turns off, that closing a door makes the room behind it inaccessible. It needs a model of how the world works — not just a snapshot of what it sees, but a theory of what will happen next.

This is what researchers mean by a world model: an AI system's internal representation of the dynamics of an environment. A world model can answer not just "what is true right now?" but "what will be true after I take this action?" and "what would have happened if I had done something different?" These are the questions that planning requires, and without them, an AI can only react to what it directly perceives rather than reason about futures it hasn't experienced yet.

The concept has roots in cognitive science, control theory, and AI research going back decades. In early AI, world models were hand-crafted rule systems: explicit databases of facts and rules about how objects behave. In classical reinforcement learning, the world model was called a "transition function" or "dynamics model" — a learned function that predicts the next state of the environment given the current state and an action. The key property in both cases is the same: the model captures dynamics, not just appearance.

Planning with world models. The most compelling use of a world model is planning: before taking any action in the real world, simulate many possible futures inside the model and choose the action that leads to the best outcome. This "planning in imagination" is far more sample-efficient than trial-and-error learning, because you can evaluate thousands of hypothetical action sequences without the time, cost, or risk of actually trying them all. The Dreamer series from DeepMind demonstrated this compellingly: by learning a compact world model from visual observations, an agent could plan entire action sequences inside its imagination and match the performance of methods that required orders of magnitude more real environment interactions. DreamerV3 (2023) extended this to work across a remarkably diverse set of environments — from video games to robotic control to 3D navigation — with the same algorithm and without any environment-specific tuning.

Video-based world models. The most discussed form of world models in 2025-2026 is video generation. The idea is compelling: video contains enormous information about how things move, interact, and change over time. A model trained on enough video should, in principle, learn the physics of the world implicitly — that balls roll downhill, that liquids flow, that people move in coordinated ways. Several major AI labs have positioned video world models as central to their plans for building physical AI.

In practice, today's video generation models are better described as "tracking-shot simulators" than world models. They excel at rendering the next frame conditioned on recent frames — generating what the camera currently sees in a way that looks physically consistent. What they struggle with is tracking what happens to parts of the scene the camera isn't showing.

A benchmark called WRBench (2026) makes this gap concrete. It shows models a scene, moves the camera away from part of the action, then moves it back — and checks whether the model renders what should have happened in the meantime. A cat jumping toward a bed while the camera is pointed at a window should have landed by the time the camera returns. Current models mostly re-render the cat in its starting position. Scaling models larger made this problem worse, not better — bigger models were better at rendering convincing frames but worse at tracking off-screen dynamics.

What's missing: persistent state. The fundamental gap is architectural. Today's video models maintain no persistent representation of world state independent of the current camera view. When the camera turns away from an object, the model loses track of it. When the camera returns, the model re-renders a plausible starting state from its training distribution, not the actual state the object should be in. Researchers describe the missing component as a "state writer" — a mechanism that continuously updates a representation of everything happening in the scene, not just what the camera currently shows.

Why this matters beyond video. World models are central to plans for robots, autonomous vehicles, and any AI that needs to operate in the physical world over time. A robot that can't track where an object went when it looked away can't reliably plan multi-step tasks. An autonomous vehicle that resets its model of nearby cars when they briefly leave the field of view is dangerous. The gap that WRBench measures in video generation is the same gap that limits physical AI more broadly.

Current world models work well in domains where the state space is compact and learnable — game environments, simple physics tasks, structured 3D scenes. Extending them to the full complexity of open physical environments, including off-screen dynamics, persistent object state, and long-horizon consequences of actions, is one of the central open problems in AI research today.

For related coverage, see our news about WRBench and the limits of today's video world models.

Key papers
Dream to Control: Learning Behaviors by Latent Imagination (2019)
Mastering Diverse Domains through World Models / DreamerV3 (2023)
WRBench: World Model Reliability Benchmark (2026)