Ethan He, who led Nvidia’s Cosmos World Model and then built Grok Image at xAI in three months, sat for a long-form interview with Latent Space published June 1. What makes it worth reading is not the career narrative. It is the specific claim He makes about where frontier video work is headed and why most current systems are not yet there.
He’s central argument is that video generation alone is a dead end as an agentic goal. The output that matters is a model that can generate a plausible next frame, treat that frame as evidence about the state of the world, and select an action accordingly. That loop, reasoning forward in time through generated video to inform real decisions, is what He calls a true video agent. No current production system reliably closes it.
The distinction matters because the field has spent three years celebrating video generation quality. Sora 1 was positioned as a step toward physical AI. Runway Gen-4, Kling, Cosmos, and Grok Image followed. Each improved resolution, temporal consistency, and prompt fidelity. None of them changed the fundamental architecture: these are generative systems, not planning systems. He’s framing treats that entire cohort as pre-agentic. Good tools. Not yet agents.
The path He describes runs through world models: models pretrained on video corpora to learn the spatial and temporal regularities of physical environments. The bet is that a model with a strong prior about how objects move, how forces propagate, and how time unfolds can generate frames that are useful for planning rather than just visually plausible. He built Cosmos 1 and 2 at Nvidia on exactly this premise. Cosmos 3 shipped the same day this interview published, after his departure to xAI.
He is candid about the infrastructure ceiling this creates. Video training is GPU-hours-per-frame expensive at a scale that text pretraining is not. A single high-quality video token encodes spatial, temporal, and semantic information that requires substantially more compute to learn than a text token. He does not give specific numbers, but the implication is clear: only labs with the capital and cluster access of Nvidia, Google, xAI, and Meta can pretrain world models at frontier scale. Everyone else is fine-tuning on top of what those labs release.
The data bottleneck is, if anything, harder than the compute bottleneck. Video training data is more expensive to obtain, harder to label with action annotations, and more difficult to curate for quality than either text or image data. Autonomous driving datasets are the closest analog to the labeled, temporally structured video that world-model training wants, but they are narrow in domain. General-purpose video corpora from the open web are available but noisy. He describes this as an unsolved infrastructure problem, not a solved one.
The skepticism worth holding is structural. Video-agent narratives have been positioned as “the next frontier” since Sora’s launch in early 2024. The agent loop He describes, generating a future frame and using it to inform action selection, has not been demonstrated in production at any lab. He is clear about this gap between research demos and deployable physical AI. His credibility earns the claim some weight. It does not verify the timeline.
For builders working on robotics, autonomous systems, or any product that needs a model with physical-world priors, the Latent Space interview is the most concrete public roadmap from someone who has shipped at this level. The practical read is that frontier world-model capability will remain concentrated at a few large-capex labs for at least the next twelve months. If your product depends on a video agent that reasons through generated frames, you are either building at one of those labs or waiting for a capable open-weight release.
Based on the Latent Space interview with Ethan He, published June 1, 2026, at latent.space.