Pro TV iLikeIT: What World Models Are — and Why They're the Next Step After ChatGPT

On May 26th I returned to Pro TV's iLikeIT, this time to talk about world models — a class of AI that doesn't predict the next word, but the next state of a physical environment. It's a quieter shift than ChatGPT was, and a deeper one.

A large language model — GPT, Claude, Gemini — is trained to continue text. Its "understanding" of gravity, friction, or object permanence is inferred indirectly, from the way words appear next to other words in the training data. It isn't grounded in physical experience. A world model is trained differently: on video and sensor streams. It learns the dynamics of a scene and predicts what happens next when something acts on the environment. Push a glass off the edge of a desk, and the model anticipates the fall and the shatter — because it has seen enough of the world to internalize the rule. That's a representation much closer to how a toddler learns than to how a chatbot learns.

"A language model predicts the next word in a sentence. A world model predicts the next state of a micro-universe — it knows that when you push a glass off the edge of a table, gravity steps in."

Why it matters now

Three research lines have all reached public, demonstrable maturity in 2026:

  • Genie 3 (Google DeepMind, launched January 2026) is the first general-purpose world model you can navigate in real time. Give it a text prompt — a city, a forest, a room — and it generates an interactive world at 24 frames per second, 720p, holding consistency for several minutes. You move through it, and the model "remembers" what it generated behind you.
  • V-JEPA 2 (Meta) takes the other route: instead of generating pixels, it learns abstract representations from video. With only 62 hours of training data it pulled off zero-shot robot planning — a robot deciding how to act in an environment it had never seen. For robotics, that kind of data efficiency matters more than visual fidelity.
  • Omni models — which see, hear, and respond on the same channel — have removed the need to convert everything to text before processing. It's the step from "AI that reads a description of a scene" to "AI that sees the scene".

The practical difference: an LLM can explain to you what would happen if you opened a tap. A world model can be wired to a robot that actually opens the tap, watches whether the water lands where it should, and adjusts. The first works with linguistic representations; the second, with causal representations of the world. For robotics, autonomous vehicles, and any agent that touches a real interface, the second is indispensable.

Gemini Spark and the rise of background agents

We also covered Gemini Spark, which Google announced last week at I/O 2026. It's an agent that no longer lives inside a chat window — it runs on a dedicated virtual machine in Google Cloud, connected to your Gmail, Calendar, Docs. You give it a goal ("track flight prices to Lisbon and book when it drops below 200 euros"), and it works in the background for hours or days. That's a new category: the AI isn't answering a question, it's executing an objective. Powered by Gemini 3.5, included in the Google AI Ultra plan, currently rolling out in the US and arriving in Europe in Q3.

Costs and rate limits

One million tokens through a frontier model today costs ten to a hundred times less than at ChatGPT's launch in late 2022. That changes who can build products on top of AI — what used to be a research budget is now a side-project budget. The rate limits users still hit on Claude or ChatGPT aren't a business decision; they're a physical constraint — there aren't enough chips, yet. Capacity moves in steps: Anthropic tightened the Claude limits earlier this spring, then doubled them again on May 6 once more inference came online. The trend line is clear — pricing and limits will keep loosening as compute scales, and access to a frontier model will eventually be a commodity.

ChatGPT taught us to talk to AI. World models are teaching it to predict the world. Agents like Spark are teaching it to act in it. The three lines will meet, and the meeting point is what many call AGI. We're not there yet — but the steps are starting to land, one after another.