Processing-In-Memory (PIM) & Memory-Centric Computing
Active FrontierProcessing-In-Memory (PIM) & Memory-Centric Computing
Processing-In-Memory reframes the oldest assumption in computing: that memory is passive and the CPU does the work. In modern AI and consumer workloads, 60–90% of total system energy is data movement, not computation — a DRAM access costs 800× a floating-point op, and up to 64,000× when sensors and storage are included. Meanwhile, state-of-the-art data-center processors spend 80–90% of their time waiting for memory. The implication: the AI compute shortage is substantially a memory-movement shortage, and the correct fix is not more cores but computation at the data.
Two commercial trajectories are converging on this idea. Near-memory approaches (HBM4E/C-HBM4E with logic on the base die, UPMEM-style bank-level processors, Samsung HBM-PIM, SK Hynix AiM) integrate conventional logic adjacent to memory. Using-memory approaches (RowClone, Ambit) exploit the analog behavior of DRAM cells to compute within the array — bulk copy via consecutive row activations, bitwise AND/OR/NOT/majority via concurrent multi-row activation. Critically, researchers (SAFARI / ETH Zurich, Mutlu group) have shown several of these ops run reliably on unmodified, commodity DRAM by violating nominal timing parameters.
Reliability scaling is forcing intelligence into memory regardless of whether the industry wants PIM. RowHammer, RowPress, and column-disturbance mechanisms make modern DRAM a physical-security liability without on-die logic. DDR5 already embeds activation counters; this is the first rung on a ladder that ends at Self-Managing DRAM — memory that schedules its own refresh and defenses and can signal "not now" to the CPU.
The bottleneck is no longer physics — it's paradigm. JEDEC (~390 companies) rarely converges on radical interface changes; the "Self-Managing DRAM" paper was rejected six times over three and a half years. Onur Mutlu frames the full shift as a "Copernican Revolution" likely to take decades. Near-term, the investable surface is narrower: HBM4E/C-HBM4E base-die compute, commercial PIM DRAM (HBM-PIM, AiM, UPMEM), CXL 3.x composable memory, and hyperscaler topology-aware software stacks that can exploit any of it.
Key Claims
- 60–90% of system energy is data movement across consumer apps (Chrome, video codecs, TF inference) and ML workloads (LSTMs, transducers). Evidence: moderate — single-researcher framing, needs independent corroboration (Mutlu synthesis)
- 80–90% of data-center processor time is spent waiting on memory. Evidence: moderate, attributed to Google (Mutlu synthesis)
- DRAM access ≈ 800× FP op, 6,400× int add, 64,000× including storage/sensors. Evidence: moderate (Mutlu synthesis)
- RowClone and Ambit operate on unmodified DRAM by violating nominal timing parameters. Evidence: moderate, demonstrated in lab (Mutlu synthesis)
- HBM4E base dies on TSMC N3P enable C-HBM4E near-memory compute — one of three C-HBM4E customization levels. Evidence: strong (HBM4 Shakeup)
- DDR5 embeds activation counters for RowHammer defense. Evidence: strong — ratified spec (Mutlu synthesis)
- RowPress induces bit flips with orders-of-magnitude fewer activations than RowHammer. Evidence: moderate — recent finding (Mutlu synthesis)
- UPMEM acquired by Qualcomm (June 2025) — RISC-V DPU cores embedded in standard DDR4/DDR5 DIMMs; up to 259× speedup for large-batch MLP inference documented. Evidence: strong — acquisition confirmed by PitchBook and CB Insights; technical architecture documented in multiple arXiv surveys (Mutlu synthesis)
- Samsung HBM-PIM (Aquabolt-XL) — 2.5× system performance, 60% energy reduction measured on Xilinx Virtex Ultrascale+ (Alveo) AI accelerator. Evidence: strong — Samsung + Hot Chips 33 disclosed (Mutlu synthesis)
- SK Hynix AiMX card (32 GB, GDDR6-AiM) ran Llama 2 70B at Hot Chips 2024 / AI HW Summit 2024; 1.25 V operating voltage vs 1.35 V standard → ~80% data-movement power reduction. Evidence: strong — SK Hynix technical disclosure (Mutlu synthesis)
- RowPress documented on commodity DDR4 — bit flips with orders-of-magnitude fewer activations than RowHammer. Evidence: strong — arXiv:2406.16153 (Luo et al., SAFARI) (Mutlu synthesis)
- Self-Managing DRAM framework for in-DRAM autonomous operations — appeared at MICRO 2024 (Yaglikci, Luo, Mutlu). Evidence: strong — peer-reviewed venue (Mutlu synthesis)
- RowHammer paper won 2024 Jean-Claude Laprie Award for dependable computing — signals field-wide recognition of the reliability angle. Evidence: strong (Mutlu synthesis)
Two Approaches Compared
| Dimension | Processing Near Memory (PNM) | Processing Using Memory (PUM) |
|---|---|---|
| Where compute lives | Logic layer in 3D-stack / on HBM base die / per DRAM bank | Inside the DRAM cell array itself |
| Logic used | Conventional ALUs, processors | Row activations, charge sharing |
| Examples | C-HBM4E NMC, HBM-PIM, AiM, UPMEM | RowClone, Ambit |
| Workload fit | General-purpose + ML kernels | Bulk memcpy, bitwise ops, RNG |
| Productization status | Commercial (Samsung, SK Hynix, UPMEM) + roadmapped (C-HBM4E 2026–27) | Research + unmodified-DRAM demonstrations |
| Programming model | Firmware + runtime extensions; CUDA-adjacent | New primitives; ISA-level changes required |
| Key barrier | Topology-aware software, yield, heat | JEDEC interface, determinism guarantees |
Energy Arithmetic
| Operation | Relative Energy (vs 64-bit FP op) |
|---|---|
| 64-bit FP multiply-add | 1× |
| 32-bit integer add | ~0.1× |
| DRAM read/write (access) | 800× |
| DRAM access vs 32-bit int add | 6,400× |
| DRAM + storage + sensor chain | ~64,000× |
Reliability Pressure Forcing Memory Intelligence
| Mechanism | What it is | Status |
|---|---|---|
| RowHammer | Repeated row activations leak charge into adjacent rows → bit flips | Exploited in the wild; DDR5 activation counters partially mitigate |
| RowPress | Long-held row activation induces flips with orders-of-magnitude fewer activations | Recently discovered; no standardized defense yet |
| Column disturbance | Thousands of rows affected simultaneously | Newly identified |
| DDR5 activation counters | On-die logic triggers adjacent-row refresh | Shipping |
| Self-Managing DRAM | Memory schedules own refresh/defense, can defer CPU requests | Research; Mutlu paper accepted after 6 rejections |
Commercial Signals to Track
- UPMEM → Qualcomm acquisition (June 2025, confirmed) — first pure-play PIM exit to a major semi. Tech: RISC-V DPU cores per DDR4/DDR5 DIMM. Watch for post-deal product announcements integrating UPMEM into Qualcomm's AI-PC / edge-AI / data-center stack.
- Samsung HBM-PIM (Aquabolt-XL) — HBM2-based, validated at 2.5× perf and 60% energy cut on Xilinx Alveo. No confirmed hyperscaler production deployment yet — watch for HBM3/HBM4-PIM successor products as Samsung scales HBM capacity +50% in 2026.
- SK Hynix AiMX (GDDR6-AiM) — 32 GB card running Llama 2 70B demoed at Hot Chips 2024; 80% data-movement power savings. Watch for LPDDR6-AiM variant targeting on-device AI (CES 2026 roadmap hint).
- C-HBM4E near-memory compute — level-3 customization in the HBM4E roadmap; hyperscaler topology-aware stacks are the gating item. NVIDIA Rubin Ultra (1 TB HBM4E, 8 stacks) is the natural first host.
- CXL 3.x composable memory — memory disaggregation sets the interface conditions for memory-centric design at rack scale.
- JEDEC DDR6 spec drafts — watch for activation counter evolution and any hint of CPU-memory interface flexibility (Self-Managing DRAM primitives would be the breakthrough signal).
- RowPress and column-disturbance exploits in the wild — first published CVE will force a JEDEC response and validate the "reliability forces intelligence" thesis.
Investment Implications (preliminary)
- Reframes NVIDIA's moat from FLOPs to memory-system integration (NVLink, Grace–Hopper coherence, rack-scale memory bandwidth).
- Elevates memory makers — SK Hynix, Samsung Memory, Micron — from capacity suppliers to active-compute participants as C-HBM4E NMC and HBM-PIM/AiM gain traction.
- Fabric & interface layer (Astera Labs, Marvell, Broadcom, Rambus, Synopsys/Cadence IP) is the arbitrage seat as CXL 3.x and custom D2D interfaces proliferate.
- Standards-body dynamics are a genuine alpha source: JEDEC activity is a forward indicator 3–5 years out.
- Pure-play PIM is thin — UPMEM was the notable independent; post-acquisition this is mostly a "hidden inside an incumbent" story.
Open Questions
- Do the 60–90% data-movement energy figures replicate on independent hyperscaler workloads (Meta, Microsoft benchmarks)?
- Does C-HBM4E NMC find a production workload beyond vector-DB / recommendation serving, or does it stay a niche accelerator?
- What's the realistic 3-year PIM TAM — is it a 5% sidecar to HBM, or a 30%+ reshaping of memory ASP?
- Do hyperscaler custom stacks (TPU, Trainium, MTIA, Maia) get first or last access to C-HBM4E topology-aware software?
- Does a CXL 3.x-native memory-centric reference architecture emerge, or does it stay a loose constellation of vendor-specific features?
- What's the attacker's-view timeline on RowPress exploits in the wild, and does it force a JEDEC spec revision?
Related Concepts
- HBM4 Memory Architecture — C-HBM4E near-memory compute is the first commercial wedge of PIM into mainstream AI accelerators
- Custom Silicon vs GPU — memory-system integration is where ASICs either catch NVIDIA or don't
- Rack-Scale AI Compute — rack-as-product design lets hyperscalers ship topology-aware memory software
Backlinks
Pages that reference this concept:
Changelog
- 2026-04-21 — Initial compilation from Mutlu (ETH Zurich / SAFARI) synthesis. Cross-linked to HBM4 concept since C-HBM4E NMC is the commercial bridge.
- 2026-04-21 — Primary-source pass: UPMEM/Qualcomm acquisition confirmed (June 2025); Aquabolt-XL numbers (2.5× / 60%) and AiMX/Llama 2 70B demo added; RowPress (arXiv:2406.16153) and Self-Managing DRAM (MICRO 2024) cited; RowHammer paper's 2024 Jean-Claude Laprie Award noted. Evidence levels upgraded accordingly.
Related Concepts
Theses that depend on this concept
These research positions cite this concept in their evidence. If the concept changes materially, these theses may need re-scoring.