NVIDIA has published ZPPO (Zone of Proximal Policy Optimization), a reinforcement-learning post-training method designed to close the gap that standard RL leaves open: hard questions that a model fails consistently get discarded, and the model never learns from them.

The problem is structural. In GRPO-style RL training, a model generates many rollouts for each question and is rewarded or penalized based on whether it succeeds. When a question is so difficult that rollout accuracy sits near zero, no learning signal gets through. The question gets dropped from training and the model’s weakness on that class of problems persists permanently. NVIDIA’s project page, authored by Byung-Kwan Lee, frames this as the central failure mode motivating ZPPO.

The fix is a replay buffer. ZPPO admits a question to the buffer when its rollout accuracy falls below 50 percent. The model then sees that question repeatedly across training batches rather than once. Each revisit strengthens what the ZPPO paper calls the BCQ and NCQ effects on hard questions, gradually pushing rollout accuracy upward. A question exits the buffer only once its accuracy reaches 50 percent, at which point it has graduated from the hard-question category.

That graduation dynamic is where ZPPO separates from GRPO with a teacher injection patch, the method most teams reach for when standard RL stalls on hard examples. Injecting a teacher model’s response directly into the student’s training stream breaks the on-policy assumption. The student gradient incorporates outputs the student did not produce, which distorts the policy distribution and degrades generalization on questions outside the training set. ZPPO avoids teacher injection entirely; the student figures out hard questions on its own, given enough repeated exposure.

The benchmark results, run on Qwen3.5 with a 27B teacher, show the advantage across three modalities. On 10 LLM benchmarks, ZPPO delivers a plus-7.9 accuracy-point gain versus a baseline. GRPO alone reaches plus-3.5 on those same benchmarks, and GRPO augmented with teacher responses reaches only plus-1.9 (the on-policy violation costs it the gains from the teacher knowledge). On 16 VLM benchmarks, ZPPO posts plus-9.3 points; GRPO alone reaches plus-4.4. On 5 video benchmarks, ZPPO reaches plus-4.5; GRPO alone reaches plus-2.2. Distillation methods, both off-policy and on-policy, land negative on LLM and video benchmarks, a result consistent with the known overfitting failure mode of imitation-based post-training.

The numbers come from NVIDIA’s own project page and have not been reproduced by independent evaluation at the time of publication.

The framing here matters for teams building reasoning-capable models. The dominant post-training stack as of mid-2026 involves GRPO or similar RL methods for reasoning elicitation, often combined with distillation from a larger teacher. ZPPO’s finding suggests that both components of that hybrid have underappreciated failure modes. Distillation degrades generalization. Teacher injection into RL breaks on-policy training. A replay buffer that lets the student work through hard cases without imitation pressure is a simpler intervention, and the accuracy numbers on video benchmarks, a modality notorious for sparse reward signals, suggest the approach transfers beyond text.

The ZPPO method was developed at NVIDIA and is described in a public project page. The gap ZPPO closes is widest, the project page notes, in the regime where initial rollout accuracy starts near zero, which is precisely where most hard-question training currently offers nothing.

Teams running RL post-training pipelines on multimodal models should treat the ZPPO replay-buffer design as a concrete alternative to teacher injection before committing to a distillation-plus-GRPO hybrid for their next training run.

Source: NVIDIA project page authored by Byung-Kwan Lee, published June 19, 2026.