Fei-Fei Li draws a map the big labs don't want you to see

Every major AI lab building spatial AI right now is betting on one axis. Fei-Fei Li, co-founder of World Labs, published a piece on June 3 that makes that single-axis problem visible and names it.

The argument starts with a terminological complaint that doubles as a structural critique. “World model” currently covers a video generator that makes physically impossible fire look beautiful, a physics engine that respects Newton’s laws, and a robotic planning system deciding where to move a gripper. All three go by the same label. Li argues this conflation is not just imprecise; it is operationally misleading, because the three represent distinct functional outputs with different data requirements, different failure modes, and different commercial ceilings.

Her taxonomy splits the space into renderers, simulators, and planners. A renderer outputs pixels: it is optimized for visual plausibility and trained on internet video. Google’s Veo, OpenAI’s Sora, and Meta’s Movie Gen sit here. They produce outputs that look correct. They cannot be trusted to tell you whether a structure will hold. A simulator outputs state: geometrically and physically accurate representations that both humans and software agents can compute on. Nvidia’s Omniverse and, increasingly, Cosmos target this category. A planner outputs actions: given an observation and a goal, it tells an agent what to do next. Boston Dynamics-style robotic control and the newer Vision-Language-Action model wave live here.

The implicit scoring is pointed. Sora scores high on rendering, low on simulation, unproven on planning. Google’s Genie 3 is a renderer that can condition on user input, which is one step toward planning, but it produces no physical state a robot could use. Nvidia’s Cosmos 3, covered in this newsletter yesterday, is notable precisely because it adds action output to a model with strong physics dynamics. On Li’s taxonomy, Cosmos 3 is among the first industry releases to score meaningfully on two axes simultaneously. Meta’s V-JEPA, designed as a latent world model for video prediction, sits awkwardly between renderer and simulator without fully committing to either output contract.

Li’s position is not neutral. She runs World Labs, and the piece functions as a positioning document as much as a research contribution. World Labs’ Marble product generates explorable 3D environments from multimodal prompts, outputting both Gaussian splats for visual rendering and collision meshes that a physics engine can operate on. That dual output is the practical embodiment of her argument: that the renderer-simulator boundary should dissolve inside a single architecture.

The piece is worth reading with that conflict of interest visible. The taxonomy itself, however, does not require World Labs to be right about its own product to be analytically useful. The three-category split maps onto real divergences in training data, evaluation criteria, and downstream use cases. A renderer trained on internet video has no path to sim-to-real accuracy without a fundamental data shift. A simulator optimized for physical fidelity is not naturally positioned to serve the consumer video market. Li’s central claim is that the knowledge required to do all three is largely shared, and that the field will converge toward unified architectures. That is a testable prediction, and the current trajectory of Cosmos-class models suggests it is not wrong.

The sim-to-real gap remains the structural obstacle Li does not fully resolve. AI-generated geometry can appear valid while containing self-intersections that produce nonsensical physics on contact. Annotated 3D data at the scale that renderer training takes for granted does not exist. Li acknowledges both problems without offering a timeline for closing them.

For teams building on any of the current world-model platforms, this taxonomy provides a sharper evaluation frame than benchmark scores alone. Ask which of the three output contracts your platform actually fulfills before committing to a production integration.

Fei-Fei Li on Substack (drfeifei.substack.com), published June 3, 2026.

Fei-Fei Li draws a map the big labs don't want you to see

The morning brief for people inside the AI industry.

More in Opinion

Anthropic publishes its own AI-native org playbook

GPT-5.5 cracked a real exploit 7 out of 10 times. Most models refused.

Meta's Wang bets on product factory over research lab