NVIDIA released Cosmos 3 on June 1 as an open frontier foundation model for physical AI, making the weights available for fine-tuning and self-hosting by developers building robotics systems and autonomous vehicles. The release marks the first time NVIDIA has shipped a model whose output modalities include robot action commands, not just text, image, and video.
The architecture is a mixture-of-transformers design that pairs a reasoning transformer with an expert generation transformer. This pairing is what enables the model to combine vision reasoning with multimodal generation across text, image, video, ambient sound, and robot actions in a single forward pass. NVIDIA describes the model as “omnimodel,” meaning developers do not need to stitch together separate perception and planning stacks; both run through Cosmos 3.
The action modality is the substantive claim. Most frontier foundation models top out at text and images. A model that natively emits structured robot commands changes the build pattern for embodied AI startups: instead of pairing a language model with a separate control policy, a team can fine-tune Cosmos 3 end-to-end on their target embodiment. NVIDIA says this pretraining foundation reduces the amount of in-domain robot data required compared to training a policy from scratch.
NVIDIA cited leaderboard results for Cosmos 3 on a physical-AI benchmark, but the cited leaderboard is one NVIDIA uses for its own model comparisons. Physical-AI benchmarks lack the standardized, community-maintained rigor of language benchmarks like GSM8K or MMLU, so a top ranking at this stage is closer to a marketing signal than a verified capability claim. The real test is whether robotics teams report meaningful data efficiency gains on actual hardware platforms in the months ahead.
The competitive frame matters here. Google’s RT-2 and RT-X research, Meta’s Open-X Embodiment dataset work, and Apple’s on-device robotics research all occupy the physical-AI foundation space. What NVIDIA brings that those efforts do not is silicon distribution: Cosmos 3 is designed to run on NVIDIA’s own inference hardware, and NVIDIA’s relationships with automotive and industrial robotics customers give the model a direct path to deployment at scale. The open-weights decision is strategic, not altruistic. It seeds developer adoption before proprietary competitors can lock in the ecosystem.
Fine-tuning workflows will determine whether Cosmos 3 holds its positioning. A pretrained physical-AI foundation model is only valuable if the fine-tuning cost and complexity fall within reach of a small robotics team. NVIDIA has not disclosed specific numbers on how much labeled robot data Cosmos 3 requires to reach usable policy quality on a new embodiment. That figure, when it emerges from early adopters, will settle the question of whether the data-efficiency claim is structurally true or a benchmark artifact.
For robotics startups currently sourcing or collecting in-domain training data, Cosmos 3 is worth benchmarking immediately. If the pretraining benefit is real, it reduces one of the largest capital costs in embodied AI development, and teams that build their fine-tuning pipelines on it now will have a workflow advantage before the model’s limitations become widely documented.
Sourced from the NVIDIA Newsroom announcement published June 1, 2026.