HBM4 Memory Architecture
HBM4 Memory Architecture
HBM4 is the memory breakthrough that unlocks the next generation of AI inference and training — but the bigger story is the architectural shakeup that accompanies it. Three generations matter now: HBM4 (2026), HBM4E (2026–2027), and C-HBM4E (2026–2027). Two structural shifts make this the largest HBM change since the standard was introduced:
- Base dies move from DRAM to logic processes. HBM4 base dies are manufactured on TSMC 12FFC, N5, or N3P — not traditional DRAM nodes. This makes HBM base dies ~2× more power-efficient than HBM3E's and opens the door to embedding custom logic (C-HBM4E).
- Interface doubles to 2,048 bits with 32 channels per stack, directly targeting the AI memory wall.
Debuting in NVIDIA's Rubin GPU, HBM4 delivers 22 TB/s per GPU (2.8× over Blackwell's 8 TB/s HBM3e) with 288 GB capacity. HBM4E pushes this to 3 TB/s bandwidth per stack — 2.5× over HBM3E — at up to 12.8 GT/s demonstrated speeds. The Rubin Ultra GPU is spec'd for 1 TB of HBM4E across 8 stacks, yielding potential 16 TB/s per accelerator.
C-HBM4E is the real structural change. It retains standard HBM4E devices but allows custom base dies in three escalating levels: (1) logic integration on the base die, (2) custom die-to-die interface with memory controller on the base die, and (3) near-memory compute — basic processing capabilities inside the memory device itself. NMC requires topology-aware software, runtime, and OS evolution for heterogeneous memory domains — but reframes memory from a passive bottleneck to an active compute substrate. One source calls this "potentially the biggest shift in how computers work in decades."
Key Claims
- 22 TB/s bandwidth per GPU (HBM4, Rubin) — 2.8× over Blackwell. Evidence: strong (NVIDIA Rubin)
- 288 GB capacity per GPU (HBM4, Rubin) — 1.5× over Blackwell's 192 GB. Evidence: strong (NVIDIA Rubin)
- HBM4E: 3 TB/s per stack — 2.5× bandwidth over HBM3E. Evidence: strong (HBM4 Shakeup)
- 2,048-bit interface — doubled vs HBM3E's 1,024. Evidence: strong (HBM4 Shakeup)
- 32 channels per stack — doubled concurrency. Evidence: strong (HBM4 Shakeup)
- 12.8 GT/s demonstrated (Cadence PHY) — above spec. Evidence: strong (HBM4 Shakeup)
- Base dies on TSMC logic processes — 12FFC / N5 / N3P; 2× power efficient vs DRAM-based HBM3E. Evidence: strong (HBM4 Shakeup)
- C-HBM4E with near-memory compute — basic processing inside memory devices; requires topology-aware software. Evidence: moderate (HBM4 Shakeup)
- Rubin Ultra: 1 TB HBM4E, 16 TB/s potential — 8 HBM4 stacks per accelerator. Evidence: strong (HBM4 Shakeup)
- Memory subsystems with 48 TB/s bandwidth possible — with custom D2D interfaces. Evidence: moderate (HBM4 Shakeup)
Benchmarks & Data
| Metric | HBM3e (Blackwell) | HBM4 (Rubin) | HBM4E | Improvement HBM3E→HBM4E |
|---|---|---|---|---|
| Bandwidth per GPU | 8 TB/s | 22 TB/s | up to 48 TB/s (C-HBM4E) | 2.5× (stack) / 6× (system) |
| Per-pin speed | 9.4 Gbps | 8–12.8 GT/s | 12 Gbps (12.8 demo) | 1.3× |
| I/O width | 1,024-bit | 2,048-bit | 2,048-bit | 2× |
| Channels | 16 | 32 | 32 | 2× |
| Capacity per GPU | 192 GB | 288 GB | 1 TB (Rubin Ultra) | 5× |
| Base die process | DRAM | TSMC 12FFC/N5 | TSMC N5/N3P | Logic |
| Operating voltage | 1.1V | 0.75–0.8V | 0.679–0.963V | ~35% lower |
C-HBM4E Customization Levels
| Level | What's Custom | Example Benefit |
|---|---|---|
| 1 | Logic + caches on base die | Enhanced performance, standard interface |
| 2 | Custom D2D interface | More stacks per SoC, no package expansion |
| 3 | Near-memory compute (NMC) | Processing inside memory — "biggest shift in decades" |
Roadmap
| Variant | Availability | Status |
|---|---|---|
| HBM4 | 2026 | GUC PHY tape-out N3P (Mar 2025); silicon validation Q1 2026 |
| HBM4E | 2026–2027 | In development |
| C-HBM4E | 2026–2027 | In development |
Manufacturers
- DRAM: Micron (high-volume HBM4 for NVIDIA Vera Rubin), SK Hynix, Samsung
- Base dies: TSMC (12FFC / N5 / N3P)
- IP: GUC (PHY), Rambus (controller + C-HBM4E guidance), Cadence (12.8 GT/s PHY), Siemens EDA, Synopsys
Open Questions
- What are HBM4 yield rates and cost premiums versus HBM3e?
- Does C-HBM4E near-memory compute find a real workload beyond simulation benchmarks?
- Can ASIC vendors (TPU, Trainium) get HBM4/HBM4E at competitive timelines vs NVIDIA?
- What programming models emerge for NMC — CUDA extensions, or a new stack entirely?
- Does 1 TB HBM4E per accelerator change model-parallelism tradeoffs enough to shrink racks?
Related Concepts
- Rack-Scale AI Compute — HBM4 bandwidth enables rack-scale serving
- Custom Silicon vs GPU — memory access is a key differentiator in the ASIC vs GPU debate
- Nanosheet GAA Transistor — logic process nodes (N3P) are the new home for HBM base dies
- Processing-In-Memory — C-HBM4E level-3 NMC is the commercial wedge for memory-centric computing
Backlinks
Pages that reference this concept:
Changelog
- 2026-04-09 — Initial compilation from NVIDIA Rubin + SemiAnalysis.
- 2026-04-17 — Major update: added HBM4E (2.5× bandwidth, 2,048-bit interface, 32 channels), C-HBM4E (3 customization levels including near-memory compute), and TSMC logic-process base dies. Cross-linked to Nanosheet GAA Transistor.
Related Concepts
Theses that depend on this concept
These research positions cite this concept in their evidence. If the concept changes materially, these theses may need re-scoring.