NVIDIA's ENPIRE lets coding agents train robots without human resets

NVIDIA's GEAR lab closes the real-world reinforcement learning loop by automating scene resets and verification, removing the human bottleneck that has stalled autonomous robot training.

Alessandro Benigni

PUBLISHED JUN 22, 2026

3 MIN READ

Follow on Google

2 HR AGO

NVIDIA's ENPIRE lets coding agents train robots without human resets — featured image for AI Insiders

The hardest part of teaching a robot to do something useful is not writing the algorithm. It is the ten thousand times a person has to walk over, pick up the object, put it back, and press go again.

NVIDIA’s GEAR lab published ENPIRE on June 20, a harness framework that gives coding agents a self-resetting, self-verifying environment so they can run real-world robot learning experiments without a human in the loop. The name breaks down as Environment, Policy Improvement, Rollout, and Evolution, and each module corresponds to a stage in the training cycle the system now runs autonomously.

The reset problem is the real constraint in real-world reinforcement learning, and it is chronically underemphasized compared to algorithm design. Simulation sidesteps it entirely; a virtual arm can fail and recover in milliseconds. On physical hardware, every failed trial leaves cups knocked over, zip ties in the wrong position, circuit boards askew. A human has to fix the scene before the next trial starts. That labor cap means a physical robot can run far fewer experiments per day than its simulated counterpart, which compounds into slower policy improvement and higher iteration cost. ENPIRE’s Environment module attacks this directly: a combination of object detection, segmentation models, and scripted reset behaviors returns each task to a randomized initial state and then verifies the reset succeeded before the next rollout begins.

The Policy Improvement module hands control of the training loop to a coding agent. The agent reads rollout logs, consults literature, proposes a variant (heuristic learning, behavior cloning, offline RL, online RL, or a mix), edits the policy code, and sends the next experiment to the Rollout module. The Evolution module closes the loop by analyzing which ideas raised the team’s average success rate and surfacing cross-agent inspiration so one agent’s successful branch can inform another’s.

NVIDIA tested three coding agents against the framework: Codex running GPT-5.5, Claude Code running Opus 4.7, and Kimi Code running Kimi K2.6. On a PushT task and a pin insertion task, the agents drove policies to a 99% pass-at-8 success rate on physical hardware. A parallel fleet configuration adds two efficiency metrics, Mean Robot Utilization and Mean Token Utilization, to track how well the system keeps both physical and compute resources occupied.

The research framing requires a direct caveat: ENPIRE is a lab framework, not a product. The tasks demonstrated (PushT, pin box assembly, zip tie cutting, GPU slot insertion) are controlled manipulation challenges in a research setting. Each task required careful instrumentation of the reset and verification modules before the autonomous loop could run. That instrumentation is non-trivial engineering, and the paper does not claim the system generalizes to arbitrary manipulation scenarios out of the box. How well ENPIRE transfers to tasks with less predictable physical dynamics, or to robots beyond the specific hardware used at CMU and UC Berkeley, remains an open question.

What the work does establish is a proof that the closed-loop physical training cycle is tractable at small scale. Prior work showed coding agents could automate algorithm search in simulation; ENPIRE shows the same pattern can be instantiated on real hardware when the environment scaffolding is in place. The bottleneck shifts from “humans resetting scenes” to “engineers instrumenting reset modules,” which is at least a one-time cost rather than a per-trial cost.

For teams building robot learning pipelines, the two metrics NVIDIA introduces are worth tracking independently of the framework itself. Mean Robot Utilization measures the fraction of time physical robots are actually running experiments rather than waiting. Mean Token Utilization measures token efficiency across the agent team. Both expose waste that current real-world RL pipelines rarely quantify. Any lab designing a multi-robot training setup in the next year should build those measurements into their infrastructure from the start.

Source: NVIDIA Research (GEAR lab), published June 20, 2026, at research.nvidia.com/labs/gear/enpire/.

NVIDIA's ENPIRE lets coding agents train robots without human resets

The morning brief for people inside the AI industry.

More in Agents

A New Protocol Wants to Be the Search Engine for AI Agents

The unit of AI coding work is no longer the prompt

Sakana AI ships Fugu, a multi-agent system behind a single model API