Key Contribution
Six-chip co-designed AI supercomputer platform: 50 PFLOPS FP4 inference, 288GB HBM4 at 22TB/s, 5x improvement over Blackwell
Inside the NVIDIA Vera Rubin Platform
Abstract
NVIDIA's Vera Rubin platform represents the next generation of AI compute infrastructure, treating the data center — not a single GPU — as the unit of compute. The platform comprises six co-designed chips (seven with the later addition of Groq 3 LPX) engineered for integrated operation at rack scale.
Key Contributions
- Rubin GPU delivers 50 PFLOPS NVFP4 inference and 35 PFLOPS training — a 5x and 3.5x improvement over Blackwell respectively
- First architecture to use HBM4 memory: 288GB per GPU at 22 TB/s bandwidth (2.8x over Blackwell's 8 TB/s)
- NVLink 6 provides 3.6 TB/s bidirectional bandwidth per GPU (2x Blackwell), with 260 TB/s aggregate in an NVL72 rack — more than the entire global internet
- Vera CPU with 88 custom Olympus cores (Arm), 1.5TB LPDDR5X at 1.2 TB/s, coherent CPU-GPU link at 1.8 TB/s
- 336 billion transistors per Rubin GPU (up from 208B on Blackwell)
- Rack-scale confidential computing with third-generation trusted execution
Architecture Details
Six Core Chips
- Vera CPU — 88 custom Olympus cores, 176 threads via Spatial Multithreading, 162MB unified L3 cache, PCIe Gen6 with CXL 3.1
- Rubin GPU — 224 SMs, fifth-gen Tensor Cores optimized for NVFP4/FP8, expanded special function units for attention/activation/sparse compute
- NVLink 6 Switch — 36 switches per NVL72 rack, in-network SHARP FP8 acceleration (14.4 TFLOPS per tray), hot-swappable trays
- ConnectX-9 — 800 Gb/s per port, 1.6 Tb/s quad SuperNIC per tray, 800 Gb/s inline cryptography
- BlueField-4 DPU — 64-core Grace CPU, 800 Gb/s networking, 20M IOPs NVMe storage, ASTRA trust architecture
- Spectrum-6 Ethernet — 102.4 Tb/s per switch, co-packaged silicon photonics (64x signal integrity improvement)
Performance vs. Blackwell
| Metric | Blackwell | Rubin | Improvement |
|---|---|---|---|
| NVFP4 Inference | 10 PFLOPS | 50 PFLOPS | 5x |
| NVFP4 Training | 10 PFLOPS | 35 PFLOPS | 3.5x |
| HBM Bandwidth | 8 TB/s | 22 TB/s | 2.8x |
| NVLink per GPU | 1.8 TB/s | 3.6 TB/s | 2x |
| Transistors | 208B | 336B | 1.6x |
| HBM Capacity | 192 GB | 288 GB | 1.5x |
NVL72 Rack Specs
- 72 Rubin GPUs with all-to-all NVLink topology
- 260 TB/s aggregate scale-up bandwidth
- Per tray: 200 PFLOPS, 14.4 TB/s NVLink, 2TB fast memory
- Rack power: 180-220 kW (fully liquid-cooled)
- Cableless modular trays using Paladin HD2 connectors (assembly: 5 min vs 2 hours)
Target Workloads
- Long-context inference (100K+ tokens)
- Mixture-of-Experts models with dynamic routing
- Agentic reasoning pipelines
- Continuous training/post-training
- Multi-tenant, multi-model execution
- 10x lower cost per token vs Blackwell for MoE inference
- Train MoE models with 4x fewer GPUs
Deployment Timeline
- CES 2026: Architecture announced, full production confirmed
- H2 2026: Systems shipping to customers
- March 2026 update: Vera Rubin POD announced with seventh chip
Limitations
- Extreme power density (180-220kW per rack) requires purpose-built liquid cooling infrastructure
- Co-packaged silicon photonics for Spectrum-6 is cutting-edge and may face yield challenges at scale
- Premium pricing — the "extreme co-design" strategy deepens vendor lock-in vs. open standards
Source: Inside the NVIDIA Vera Rubin Platform by Kyle Aubrey, NVIDIA
Tags
gpu-architectureai-computenvidiahbm4nvlinkdata-center