The Allen Institute for AI released MolmoMotion on June 17, a model that takes an RGB video frame, a set of marked 3D points on an object, and a plain-English action instruction, then forecasts where those points will travel over the coming seconds. The goal is a capability that robotics and embodied AI have needed for years: closing the loop between what a human says and what an object will physically do.
The training foundation is MolmoMotion-1M, which Ai2 describes as the largest corpus of action-described, object-grounded 3D point trajectories assembled to date. The numbers are specific: 1.16 million videos, 736 distinct motion types, and 5,600 different objects. An automatic pipeline extracts trajectories from unconstrained video footage, filters noisy tracks, and smooths the resulting paths. The scale matters because motion diversity is the core problem. Existing datasets tend to concentrate on a narrow slice of object types and actions; a model trained on them overfits to the scenarios it has seen and breaks on anything else.
Ai2 published two model variants. MolmoMotion-AR (autoregressive) predicts future coordinates as structured text, one step at a time. MolmoMotion-FM (flow-matching) transforms a noise distribution into continuous 3D trajectories, which lets the model represent uncertainty when several futures are equally plausible. Both variants build on Molmo 2 as their backbone, so they inherit the vision-language foundation of Ai2’s earlier multimodal work rather than starting from scratch.
To benchmark performance, the team published PointMotionBench: 2,700 human-validated video clips across 111 object categories and 61 motion types. MolmoMotion outperformed all baselines tested, including pixel-space video generators, parametric 3D methods, and constant-velocity approaches. The release includes model weights, training data, the benchmark itself, and code.
The robotics transfer numbers are where this gets operationally interesting. After fine-tuning on DROID robot manipulation videos, MolmoMotion reached 76.3% success on pick-and-place simulation tasks. The Molmo 2 baseline without motion forecasting sat at 56.0%. A similar gap appeared during training: MolmoMotion hit 51% accuracy after 10,000 training steps, versus 19% for the baseline. On physical robots, MolmoMotion matched baseline error rates after roughly 2,000 training steps; the baseline needed 12,000. The sample-efficiency story is arguably as significant as the peak accuracy, because real robot deployments are constrained by how much costly physical demonstration data teams can collect.
For video generation applications, MolmoMotion outperformed larger image-to-video models on four of five motion metrics. That result hints at an efficiency argument: a specialized motion forecaster running alongside a video generator may outperform a bigger monolithic model trained to do both.
The announced limitation is real and worth tracking. Training uses eight query points per object, which is enough for trajectory forecasting but insufficient for dense surface geometry. Complex deformable motion, think cloth, liquids, or articulated hands, remains outside the current design envelope.
Language-guided 3D motion forecasting is a structural requirement for any robot that needs to act on verbal commands in an unstructured environment. The dominant approach today is end-to-end imitation learning, which is brittle outside the training distribution. A factored approach, where a language model interprets the instruction and a motion model predicts the physical outcome, is more composable and more data-efficient. MolmoMotion is a public, benchmarked instantiation of that approach, with weights available.
Teams building manipulation pipelines on DROID-style data should benchmark MolmoMotion-FM against their current motion priors before the next data collection cycle, the sample-efficiency gap alone may justify the migration cost.
Source: Allen Institute for AI (Ai2), published June 17, 2026, via the Ai2 blog on Hugging Face.