Inside the NVIDIA Vera Rubin Platform

Tech Report
Kyle AubreyNVIDIAJanuary 5, 2026
Original Source
Key Contribution

Six-chip co-designed AI supercomputer platform: 50 PFLOPS FP4 inference, 288GB HBM4 at 22TB/s, 5x improvement over Blackwell

Inside the NVIDIA Vera Rubin Platform

Abstract

NVIDIA's Vera Rubin platform represents the next generation of AI compute infrastructure, treating the data center — not a single GPU — as the unit of compute. The platform comprises six co-designed chips (seven with the later addition of Groq 3 LPX) engineered for integrated operation at rack scale.

Key Contributions

  • Rubin GPU delivers 50 PFLOPS NVFP4 inference and 35 PFLOPS training — a 5x and 3.5x improvement over Blackwell respectively
  • First architecture to use HBM4 memory: 288GB per GPU at 22 TB/s bandwidth (2.8x over Blackwell's 8 TB/s)
  • NVLink 6 provides 3.6 TB/s bidirectional bandwidth per GPU (2x Blackwell), with 260 TB/s aggregate in an NVL72 rack — more than the entire global internet
  • Vera CPU with 88 custom Olympus cores (Arm), 1.5TB LPDDR5X at 1.2 TB/s, coherent CPU-GPU link at 1.8 TB/s
  • 336 billion transistors per Rubin GPU (up from 208B on Blackwell)
  • Rack-scale confidential computing with third-generation trusted execution

Architecture Details

Six Core Chips

  1. Vera CPU — 88 custom Olympus cores, 176 threads via Spatial Multithreading, 162MB unified L3 cache, PCIe Gen6 with CXL 3.1
  2. Rubin GPU — 224 SMs, fifth-gen Tensor Cores optimized for NVFP4/FP8, expanded special function units for attention/activation/sparse compute
  3. NVLink 6 Switch — 36 switches per NVL72 rack, in-network SHARP FP8 acceleration (14.4 TFLOPS per tray), hot-swappable trays
  4. ConnectX-9 — 800 Gb/s per port, 1.6 Tb/s quad SuperNIC per tray, 800 Gb/s inline cryptography
  5. BlueField-4 DPU — 64-core Grace CPU, 800 Gb/s networking, 20M IOPs NVMe storage, ASTRA trust architecture
  6. Spectrum-6 Ethernet — 102.4 Tb/s per switch, co-packaged silicon photonics (64x signal integrity improvement)

Performance vs. Blackwell

MetricBlackwellRubinImprovement
NVFP4 Inference10 PFLOPS50 PFLOPS5x
NVFP4 Training10 PFLOPS35 PFLOPS3.5x
HBM Bandwidth8 TB/s22 TB/s2.8x
NVLink per GPU1.8 TB/s3.6 TB/s2x
Transistors208B336B1.6x
HBM Capacity192 GB288 GB1.5x

NVL72 Rack Specs

  • 72 Rubin GPUs with all-to-all NVLink topology
  • 260 TB/s aggregate scale-up bandwidth
  • Per tray: 200 PFLOPS, 14.4 TB/s NVLink, 2TB fast memory
  • Rack power: 180-220 kW (fully liquid-cooled)
  • Cableless modular trays using Paladin HD2 connectors (assembly: 5 min vs 2 hours)

Target Workloads

  • Long-context inference (100K+ tokens)
  • Mixture-of-Experts models with dynamic routing
  • Agentic reasoning pipelines
  • Continuous training/post-training
  • Multi-tenant, multi-model execution
  • 10x lower cost per token vs Blackwell for MoE inference
  • Train MoE models with 4x fewer GPUs

Deployment Timeline

  • CES 2026: Architecture announced, full production confirmed
  • H2 2026: Systems shipping to customers
  • March 2026 update: Vera Rubin POD announced with seventh chip

Limitations

  • Extreme power density (180-220kW per rack) requires purpose-built liquid cooling infrastructure
  • Co-packaged silicon photonics for Spectrum-6 is cutting-edge and may face yield challenges at scale
  • Premium pricing — the "extreme co-design" strategy deepens vendor lock-in vs. open standards

Source: Inside the NVIDIA Vera Rubin Platform by Kyle Aubrey, NVIDIA

Tags

gpu-architectureai-computenvidiahbm4nvlinkdata-center
Inside the NVIDIA Vera Rubin Platform | KB | MenFem