NVIDIA Claims MLPerf Training 6.0 Sweep on Blackwell

NVIDIA submitted the fastest training times across all seven benchmarks in MLPerf Training 6.0, including an 8,192-GPU run on DeepSeek-V3, per results the company published June 16.

Alessandro Benigni

PUBLISHED JUN 18, 2026

3 MIN READ

Follow on Google

7 HR AGO

NVIDIA Claims MLPerf Training 6.0 Sweep on Blackwell — featured image for AI Insiders

NVIDIA’s Blackwell platform posted the fastest training times across every benchmark in MLPerf Training 6.0, the industry’s peer-reviewed suite for comparing AI training hardware. The results, announced by NVIDIA on June 16, came from vendor-submitted runs, meaning the figures reflect what NVIDIA and its partners chose to submit rather than neutral third-party testing.

MLPerf benchmarks are real and peer-reviewed, but they are self-nominated: vendors decide which configurations to enter and when. NVIDIA was the only participant to submit results across all seven benchmarks this round, which matters for interpreting the sweep claim. A competitor absent from a category is not a losing competitor.

This round, NVIDIA submitted results for both the GB200 NVL72 and the newer GB300 NVL72 rack-scale systems. The company says the GB300 delivered up to 1.6x faster training than the GB200 at equivalent scale, driven by higher NVFP4 compute density, expanded memory, and a higher sustained power ceiling. NVFP4 is a 4-bit floating-point format that reduces memory bandwidth pressure during large-scale pretraining; NVIDIA used it to train its own 550-billion-parameter Nemotron 3 Ultra model. Lower-precision training always carries an accuracy tradeoff, and NVIDIA states its methods met “strict accuracy requirements,” though independent verification of those accuracy thresholds is not part of the MLPerf submission process.

The largest run in the suite placed 8,192 GB200 NVL72 GPUs on the DeepSeek-V3 671B workload, which NVIDIA describes as the biggest Blackwell cluster ever submitted to MLPerf Training. CoreWeave’s GB300 NVL72 configuration hit the DeepSeek-V3 quality target in 2.02 minutes at that scale. Microsoft Azure reached the Llama 3.1 405B quality target in 7.07 minutes on a matching GPU count.

The core engineering story here is about NVLink and its role in mixture-of-experts routing. MoE models like DeepSeek-V3 route each token to a small subset of expert subnetworks, requiring frequent all-to-all GPU communication. Within each NVL72 rack, fifth-generation NVLink Switches connect all 72 GPUs into a shared compute-and-memory pool, which NVIDIA says reduces the communication bottleneck that typically limits MoE throughput at scale. The competitive question is whether AMD’s MI300X clusters or custom silicon from Google (TPU v5) and Amazon (Trainium 2) can match that routing efficiency. Neither AMD nor the hyperscaler custom-silicon programs submitted results in this MLPerf round, so a direct comparison does not exist in this dataset.

On reliability, NVIDIA highlighted two features built for long training runs. The RAS Engine (Reliability, Availability and Serviceability) monitors the chip continuously and routes around detected faults without stopping a job. The NVIDIA Resiliency Extension (NVRx) handles fault recovery at the cluster level: when a node goes down, the system restores from a recent checkpoint rather than restarting the full job from scratch. Both are production features that matter at the scale of weeks-long training runs across tens of thousands of GPUs, where a single node failure on a naive setup can waste days of compute.

Ecosystem participation was broad. Nineteen organizations submitted results on Blackwell hardware, including Cohere (3x faster training on GB200 NVL72 for its North agentic AI platform, per NVIDIA’s own case study), Midjourney (Blackwell-trained v8 image model), and Thinking Machines Lab (2x faster training on GB300 NVL72 via Google Cloud).

The 8,192-GPU scale matters for training economics, not just benchmark positioning. At that cluster size, a 2-minute quality-target run for a 671-billion-parameter MoE model represents a practical compression of iteration cycles from days to hours. Teams planning frontier training runs in the next six to twelve months should treat the GB300 NVL72 throughput data as directional rather than definitive until independent replication or AMD and Google submit comparable configurations.

Based on a post published June 16, 2026 on the NVIDIA blog, authored by Shruti Koparkar, reporting NVIDIA’s own submitted results for MLPerf Training 6.0.

NVIDIA Claims MLPerf Training 6.0 Sweep on Blackwell

The morning brief for people inside the AI industry.

More in Tools

Anthropic pulls back on Agent SDK billing split before it hit

Codex Gains Live Browser Control via Chrome DevTools Protocol

Cursor launches Origin, a Git forge built for parallel AI agents