News · 2026-06-27

This AI predicts how objects move by tracking shapes, not pixels

Most AI that predicts the physical world does it by predicting pictures. You feed in video frames and the model guesses the next frames, pixel by pixel. It looks impressive, but it has a quiet weakness: a model working in pixels doesn't really know that a coffee mug is a solid object. It knows what a mug tends to look like from one camera angle. Move the camera, and its grasp of the mug's actual shape can fall apart, because it never represented the shape in the first place - only the photograph of it. A new model called PhysiFormer takes a different route. Instead of predicting how a scene will look, it predicts how objects will move in real three-dimensional space.

The shift sounds technical but it's intuitive. PhysiFormer represents an object the way a graphics or engineering program would: as a mesh, a connected web of points in 3D coordinates that defines its surface. Give it the starting positions and velocities of those points, plus what the object is made of - rigid like a wooden block, or elastic like a rubber ball - and it predicts where every point travels next. It is forecasting the motion of the thing itself, in world coordinates, not the appearance of the thing from a particular viewpoint. Change where you stand and the prediction doesn't break, because the model was never relying on the view to begin with.

Here's an analogy. A pixel-based predictor is like a sports artist sketching what the next photo of a bouncing ball will look like. PhysiFormer is like a physics student tracking the ball's actual position and speed and saying where it'll be a moment later. The artist can be fooled by lighting, angle, and shadow. The student is reasoning about the ball, so it works from any seat in the stadium.

The genuinely surprising part is what PhysiFormer doesn't need. Researchers who build physics-aware AI usually bake in rules by hand - force the model to keep rigid objects rigid, force it to respect cause and effect. PhysiFormer skips most of that. Built on a diffusion transformer - the same family of model behind modern image and video generators, which learns by repeatedly turning noise into structure - it learns physical behavior from data alone, and still comes out respecting things like rigidity and conserved momentum better than the previous standard approach. It also handles many objects at once gracefully, treating them in a way that doesn't care which object you list first, which is how the real world works: a pile of blocks has no official ordering.

There's another nice property. PhysiFormer is probabilistic - it doesn't commit to one single future but can sample several plausible ones. That matches reality, where a teetering stack of objects could topple in more than one believable way. A model that admits this uncertainty is more honest, and more useful for planning, than one that fakes a single confident answer.

Why does this matter? Because predicting physical interactions is foundational for robots that manipulate objects, for graphics and animation that need to look right rather than just plausible, and for any design tool that has to simulate how materials behave. Doing it in coordinate space rather than pixel space means the predictions are geometry-aware and viewpoint-independent - exactly the qualities a robot needs when its camera is in a different spot than the camera in the training data. PhysiFormer arrived as part of a larger surge of world-model research this week, and it represents one of the cleaner ideas in that wave: stop predicting the photograph, start predicting the thing.

The honest caveat: representing the world as 3D meshes assumes you can get those meshes in the first place, which is straightforward in simulation and much harder from a raw camera feed in a messy real kitchen. The results are reported on the authors' own evaluations against autoregressive baselines, and a method that shines on controlled object-motion tasks still has to prove itself on the clutter and noise of the real world. But the core bet - that geometry beats appearance for understanding physics - is a compelling one, and it's a direction worth watching as world models mature.

Primary source, verified: read the paper → (arXiv 2606.27364)