NVIDIA's LocateAnything decodes vision-language bounding boxes in parallel

NVIDIA Research released LocateAnything on May 28, a vision-language grounding framework that predicts bounding box coordinates in parallel rather than autoregressively. The architecture change addresses a latency bottleneck that has limited the deployment of grounding models in real-time applications.

Most existing vision-language grounding models, including the strongest open-source ones, decode bounding box coordinates as a sequence of tokens: the model produces “x1”, then “y1”, then “x2”, then “y2”, with each prediction conditioned on the previous one. That autoregressive decoding works correctly but adds four sequential decoding steps per box, which compounds quickly when a frame contains many objects. LocateAnything sidesteps this by predicting all four coordinates simultaneously through a parallel decoder head that operates on the model’s final visual-language representation.

The framework targets applications where grounding latency matters: real-time video understanding, robotic perception, AR overlays, autonomous driving perception stacks. For batch use cases (post-hoc image annotation, training data labeling), the latency win is less significant. For streaming use cases, it can be the difference between an architecture that works at 30fps and one that does not.

The structural caveat is that NVIDIA Research releases are typically reference implementations rather than production-ready libraries. The model weights, code, and benchmark numbers are likely published on the project page, but the integration work to bring this into an existing perception pipeline (PyTorch versions, CUDA dependencies, license terms for commercial use) is up to the adopting team.

For computer vision and robotics teams running visual grounding in latency-sensitive contexts, LocateAnything is worth a focused evaluation against your current grounding pipeline. The parallel-decoding architecture is the kind of change that propagates to downstream inference cost only if your workload is actually bottlenecked at the decoder step. Profile first.

Published by NVIDIA Research on 2026-05-28.

NVIDIA's LocateAnything decodes vision-language bounding boxes in parallel

The morning brief for people inside the AI industry.

More in Models

Anthropic releases Opus 4.8 with effort controls and cheaper fast mode

Microsoft is reportedly building its own AI coding model

MiniMax teases M3 with sparse attention that runs 15.6x faster at long context