HBM4 Memory Architecture
Active FrontierHBM4 Memory Architecture
HBM4 is the memory breakthrough that unlocks the next generation of AI inference and training. Debuting in NVIDIA's Rubin GPU, HBM4 delivers 22 TB/s bandwidth per GPU — a 2.8x improvement over Blackwell's 8 TB/s HBM3e — with 288 GB capacity (up from 192 GB).
Memory bandwidth has become the primary bottleneck for large model inference. Transformer attention scales quadratically with context length, and mixture-of-experts models require fast access to large parameter sets with dynamic routing. HBM4's bandwidth gains directly translate to higher inference throughput: NVIDIA claims 5x inference improvement over Blackwell for the Rubin GPU, with memory bandwidth being the key enabler alongside architectural improvements in the fifth-generation Tensor Cores.
The capacity increase to 288 GB per GPU matters for long-context workloads and large MoE models. In an NVL72 rack, total fast memory reaches approximately 2 TB per tray (with additional LPDDR5X on the Vera CPU side at 1.5 TB per CPU, 1.2 TB/s bandwidth). This aggregate memory pool enables serving models that previously required multiple racks.
Key Claims
- 22 TB/s bandwidth per GPU — 2.8x over Blackwell's HBM3e at 8 TB/s. Evidence: strong (NVIDIA Vera Rubin)
- 288 GB capacity per GPU — 1.5x over Blackwell's 192 GB, enabling larger models in fewer GPUs. Evidence: strong (NVIDIA Vera Rubin)
- Memory bandwidth is the primary inference bottleneck — HBM4 gains directly enable 5x inference improvement. Evidence: strong (NVIDIA Vera Rubin)
- Vera CPU memory: 1.5 TB LPDDR5X at 1.2 TB/s — Coherent CPU-GPU link at 1.8 TB/s for unified memory access. Evidence: strong (NVIDIA Vera Rubin)
Benchmarks & Data
| Metric | HBM3e (Blackwell) | HBM4 (Rubin) | Improvement |
|---|---|---|---|
| Bandwidth | 8 TB/s | 22 TB/s | 2.8x |
| Capacity | 192 GB | 288 GB | 1.5x |
- Vera CPU adds 1.5 TB LPDDR5X at 1.2 TB/s per CPU (NVIDIA)
- Coherent CPU-GPU link at 1.8 TB/s (NVIDIA)
Open Questions
- What are HBM4 yield rates and cost premiums versus HBM3e?
- How does HBM4 power consumption compare at the per-GPU and per-rack level?
- Will HBM4 be available to custom ASIC vendors (TPU, Trainium) at competitive timelines?
- Does the 288 GB capacity ceiling force model architecture choices, or is it sufficient for 2026-2027 model sizes?
Related Concepts
- Rack-Scale AI Compute — System architecture that leverages HBM4 bandwidth
- Custom Silicon vs GPU — Memory access is a key differentiator in the ASIC vs GPU debate
Backlinks
Pages that reference this concept: