HBM4 Memory Architecture

Active Frontier

hbm4memory-bandwidthai-computeinference

HBM4 Memory Architecture

HBM4 is the memory breakthrough that unlocks the next generation of AI inference and training. Debuting in NVIDIA's Rubin GPU, HBM4 delivers 22 TB/s bandwidth per GPU — a 2.8x improvement over Blackwell's 8 TB/s HBM3e — with 288 GB capacity (up from 192 GB).

Memory bandwidth has become the primary bottleneck for large model inference. Transformer attention scales quadratically with context length, and mixture-of-experts models require fast access to large parameter sets with dynamic routing. HBM4's bandwidth gains directly translate to higher inference throughput: NVIDIA claims 5x inference improvement over Blackwell for the Rubin GPU, with memory bandwidth being the key enabler alongside architectural improvements in the fifth-generation Tensor Cores.

The capacity increase to 288 GB per GPU matters for long-context workloads and large MoE models. In an NVL72 rack, total fast memory reaches approximately 2 TB per tray (with additional LPDDR5X on the Vera CPU side at 1.5 TB per CPU, 1.2 TB/s bandwidth). This aggregate memory pool enables serving models that previously required multiple racks.

Key Claims

22 TB/s bandwidth per GPU — 2.8x over Blackwell's HBM3e at 8 TB/s. Evidence: strong (NVIDIA Vera Rubin)
288 GB capacity per GPU — 1.5x over Blackwell's 192 GB, enabling larger models in fewer GPUs. Evidence: strong (NVIDIA Vera Rubin)
Memory bandwidth is the primary inference bottleneck — HBM4 gains directly enable 5x inference improvement. Evidence: strong (NVIDIA Vera Rubin)
Vera CPU memory: 1.5 TB LPDDR5X at 1.2 TB/s — Coherent CPU-GPU link at 1.8 TB/s for unified memory access. Evidence: strong (NVIDIA Vera Rubin)

Benchmarks & Data

Metric	HBM3e (Blackwell)	HBM4 (Rubin)	Improvement
Bandwidth	8 TB/s	22 TB/s	2.8x
Capacity	192 GB	288 GB	1.5x

Vera CPU adds 1.5 TB LPDDR5X at 1.2 TB/s per CPU (NVIDIA)
Coherent CPU-GPU link at 1.8 TB/s (NVIDIA)

Open Questions

What are HBM4 yield rates and cost premiums versus HBM3e?
How does HBM4 power consumption compare at the per-GPU and per-rack level?
Will HBM4 be available to custom ASIC vendors (TPU, Trainium) at competitive timelines?
Does the 288 GB capacity ceiling force model architecture choices, or is it sufficient for 2026-2027 model sizes?

Related Concepts

Rack-Scale AI Compute — System architecture that leverages HBM4 bandwidth
Custom Silicon vs GPU — Memory access is a key differentiator in the ASIC vs GPU debate

Backlinks

Pages that reference this concept:

Related Concepts

Custom Silicon vs GPU

Active Frontier

custom-siliconasicgpu+3

Rack-Scale AI Compute

Active Frontier

rack-scaleco-designdata-center+2

Sources

nvidia-vera-rubin-platform custom-silicon-inflection-2026