This is not a world model, this ise at best the reimplementation of the the NVIDIA prior art around NeRF / 3D Gaussian Splatting and monocular depth, wrapped in a nice product and workflow. What they’re actually shipping is an offline asset generator: you feed it text, images, or video, it runs depth/structure estimation and neural 3D reconstruction, and you get a static splat/mesh world you can then render or simulate in a real engine. That’s useful and impressive engineering, but it’s very different from a proper “world model” in the RL/embodied‑AI sense. Here there’s no online dynamics, no agent loop, and no interactive rollouts; it’s closer to a high‑end NeRF/GS pipeline plus tooling than to something like Google’s Genie/2/3, which actually couples generative rendering with action‑conditioned temporal evolution. Calling this a “world model” feels more like marketing language than a meaningful technical distinction.
Infact my definition of a world model is more closer to what Demis has hinted in his discussions, that video gen models like veo are able to intuit they physics from just video trainingdata suggest that there is an underlying manifold in reality that is essentially computable and thus is being simulated by these models. Building such a model would essentially mean building a physics engine of some kind that predicts this manifold.