ANALYSIS2026-04-21·ETH Zurich / SAFARI Group

Memory-Centric Computing: A Paradigm Shift for Sustainable and Efficient Systems

Onur Mutlu (ETH Zurich) — synthesis of public talks and papers

COMPILED NOTES

Argues processor-centric computing is fundamentally broken: 60–90% of system energy is data movement, not compute; DRAM can compute (RowClone, Ambit); reliability threats (RowHammer, RowPress, column disturbance) force memory intelligence anyway; JEDEC + trillion-dollar incumbency is the bottleneck, not technology

Memory-Centric Computing: A Paradigm Shift for Sustainable and Efficient Systems

Executive Summary

The current computing paradigm is fundamentally flawed by its "processor-centric" nature, which treats memory as a passive, "dumb" storage device. This architecture necessitates constant data movement between storage, memory, and processor, producing massive energy waste, performance bottlenecks, and reliability issues. Between 60% and 90% of total system energy in modern workloads — from mobile browsing to large-scale ML — is consumed by data movement across the memory hierarchy.

Transitioning to "memory-centric" computing means (1) enabling memory to perform active computation (Processing-In-Memory / PIM) and (2) letting memory autonomously manage its own maintenance and reliability (Self-Managing DRAM). Orders-of-magnitude improvements in efficiency are technically evident, but three barriers remain: a trillion-dollar processor-centric industry, rigid JEDEC interface standards, and conservative academic/industrial review processes.

The Crisis of Processor-Centric Computing

Modern systems allocate ~95% of hardware real estate to storing and moving data, yet these functions remain the primary bottlenecks.

Performance & Energy Bottlenecks

The waiting processor: Data from Google indicates state-of-the-art data-center processors spend 80–90% of their time waiting for memory loads; instructions execute only 10–20% of the time.
Energy disparity: Moving data dwarfs the cost of computing on it.
- DRAM read/write ≈ 800× a 64-bit double-precision FP op.
- DRAM read/write ≈ 6,400× a 32-bit integer add.
- Including storage and sensors, the ratio escalates to ~64,000× the computation's actual energy.

Real-World Workloads (Google joint study)

Consumer apps: >60% of total system energy in Chrome browsing, video encode/decode, and TensorFlow inference is data movement.
ML models: >90% of total system energy for LSTMs and transducers is memory + off-chip interconnect.

Reliability & Scaling

Scaling DRAM to smaller nodes increases susceptibility to physical disturbance — forcing intelligence into memory controllers.

RowHammer — repeated row activations cause charge leakage that flips bits in adjacent rows. Exploited for sandbox breakouts, cryptographic-key corruption, unauthorized access.
RowPress — keeping a row active for extended periods induces bit flips with orders of magnitude fewer activations than RowHammer.
Column disturbance — newly identified mechanism affecting thousands of rows simultaneously.

Industry patches: DDR5 now includes activation counters that trigger adjacent-row refreshes. These are patches, not fundamental shifts toward autonomous memory.

The Solution: Processing-In-Memory (PIM)

Processing Near Memory (PNM)

Places traditional logic close to or inside the memory chip (e.g., 3D-stacked memory with logic layer).

UPM (reportedly acquired by Qualcomm) designs DRAM chips with a general-purpose multi-threaded processor per bank.
Potential: distributed compute across memory controllers, accelerating LLMs to the point of "GPU-free" systems.

Processing Using Memory (PUM)

Exploits the analog properties of memory cells to compute with minimal additional logic.

RowClone: consecutive DRAM activates copy data row-to-row within a subarray without CPU involvement — major memcpy energy/latency reduction.
Ambit: concurrent activation of multiple rows enables bitwise majority / AND / OR / NOT — a "blockwise computation engine" inside DRAM.
Real-world validation: testing on existing, unmodified DRAM by violating timing parameters demonstrates that copy, logical ops, and true RNG can often be performed reliably with significant throughput gains.

Architectural & Theoretical Shifts

Concept	Processor-Centric (now)	Memory-Centric (proposed)
Interface	Rigid; CPU controls all memory timing (e.g., refresh every 7.8µs)	Self-Managing DRAM: memory can signal "no" to CPU during internal tasks (refresh, RowHammer protection)
Complexity theory	Big-O on processor operations	Data-centric complexity models that count data movement
System roles	CPU/accelerator = master, memory = slave	Distributed system of equal, coordinating agents

Barriers to Adoption

Economic: trillion-dollar investment in processor-centric infrastructure creates entrenched dogma.
Standards: JEDEC committee (~390 companies) rarely converges on radical interface changes.
Review processes: Mutlu's "Self-Managing DRAM" paper was rejected 6× over 3.5 years before acceptance; revolutionary ideas are dismissed as "commercially unattractive" or requiring full-stack co-design.
Mindset gap: device-level physics (aging, spatial variation) still under-understood after 60 years of DRAM use; device-level and system-level research remain siloed.

Conclusion

The author frames this as a "Copernican Revolution" taking decades to realize. Sustainability and energy constraints make the status quo increasingly untenable. Treating memory as a combined computation + storage substrate yields orders-of-magnitude efficiency gains — if the industry can overcome systemic and economic resistance to radical architectural change.

Key Numbers

80–90% — time data-center processors spend waiting for memory
60% — system energy spent on data movement in Chrome, video codecs, TF inference
90% — system energy spent on memory/interconnect for LSTMs and transducers
800× — DRAM access energy vs 64-bit FP op
6,400× — DRAM access energy vs 32-bit int add
64,000× — storage+sensor+DRAM access vs actual computation
95% — hardware real estate dedicated to storing/moving data
7.8µs — current DRAM refresh cadence (CPU-imposed)
390 — approximate number of JEDEC member companies
6 rejections / 3.5 years — Self-Managing DRAM paper acceptance history

Related Work to Ingest Next

Mutlu group (SAFARI) arXiv papers: RowClone, Ambit, Self-Managing DRAM (MICRO 2024, Yaglikci/Luo/Mutlu), RowPress (arXiv:2406.16153), related survey (arXiv:2503.16749)
UPMEM / Qualcomm acquisition primary source ✓ confirmed June 2025 — CB Insights, PitchBook, design-reuse
Samsung HBM-PIM (Aquabolt-XL) — Hot Chips 33 paper, Samsung Newsroom; validated on Xilinx Alveo at 2.5× perf / 60% energy cut
SK Hynix AiM — GDDR6-AiM + AiMX 32 GB card; Hot Chips 2024 / AI HW Summit 2024; Llama 2 70B demo
RowHammer paper — won 2024 Jean-Claude Laprie Award (dependable computing)
JEDEC DDR5 / DDR6 activation counter specifications
CXL 3.x composable memory specifications
PIM-AI architectures survey (arXiv:2411.17309) — LLM inference-specific PIM designs

Verification Pass (2026-04-21)

Cross-checked key claims against public sources; evidence upgraded in the Processing-In-Memory concept page:

✅ UPMEM → Qualcomm acquisition confirmed (June 2025)
✅ Samsung Aquabolt-XL: 2.5× perf, 60% energy cut on Xilinx Alveo
✅ SK Hynix AiMX: 32 GB card, Llama 2 70B, 80% data-movement power saving
✅ RowPress: arXiv:2406.16153, commodity DDR4 bit flips
✅ Self-Managing DRAM: MICRO 2024, Yaglikci/Luo/Mutlu
✅ RowHammer paper: 2024 Jean-Claude Laprie Award
⚠️ 60–90% energy figures: still single-source (Mutlu framing of Google joint study); independent hyperscaler replication not yet located
⚠️ 800× / 6,400× / 64,000× DRAM vs compute energy ratios: widely cited but upstream source is Horowitz 2014 ISSCC keynote — need to verify whether figures have shifted meaningfully with modern nodes

RELATED · IN THE BASE

ANALYSIS2026-04-21·ETH Zurich / SAFARI Group

Memory-Centric Computing: A Paradigm Shift for Sustainable and Efficient Systems

Onur Mutlu (ETH Zurich) — synthesis of public talks and papers

COMPILED NOTES

Memory-Centric Computing: A Paradigm Shift for Sustainable and Efficient Systems

Executive Summary

The Crisis of Processor-Centric Computing

Modern systems allocate ~95% of hardware real estate to storing and moving data, yet these functions remain the primary bottlenecks.

Performance & Energy Bottlenecks

The waiting processor: Data from Google indicates state-of-the-art data-center processors spend 80–90% of their time waiting for memory loads; instructions execute only 10–20% of the time.
Energy disparity: Moving data dwarfs the cost of computing on it.
- DRAM read/write ≈ 800× a 64-bit double-precision FP op.
- DRAM read/write ≈ 6,400× a 32-bit integer add.
- Including storage and sensors, the ratio escalates to ~64,000× the computation's actual energy.

Real-World Workloads (Google joint study)

Consumer apps: >60% of total system energy in Chrome browsing, video encode/decode, and TensorFlow inference is data movement.
ML models: >90% of total system energy for LSTMs and transducers is memory + off-chip interconnect.

Reliability & Scaling

Scaling DRAM to smaller nodes increases susceptibility to physical disturbance — forcing intelligence into memory controllers.

RowHammer — repeated row activations cause charge leakage that flips bits in adjacent rows. Exploited for sandbox breakouts, cryptographic-key corruption, unauthorized access.
RowPress — keeping a row active for extended periods induces bit flips with orders of magnitude fewer activations than RowHammer.
Column disturbance — newly identified mechanism affecting thousands of rows simultaneously.

Industry patches: DDR5 now includes activation counters that trigger adjacent-row refreshes. These are patches, not fundamental shifts toward autonomous memory.

The Solution: Processing-In-Memory (PIM)

Processing Near Memory (PNM)

Places traditional logic close to or inside the memory chip (e.g., 3D-stacked memory with logic layer).

UPM (reportedly acquired by Qualcomm) designs DRAM chips with a general-purpose multi-threaded processor per bank.
Potential: distributed compute across memory controllers, accelerating LLMs to the point of "GPU-free" systems.

Processing Using Memory (PUM)

Exploits the analog properties of memory cells to compute with minimal additional logic.

RowClone: consecutive DRAM activates copy data row-to-row within a subarray without CPU involvement — major memcpy energy/latency reduction.
Ambit: concurrent activation of multiple rows enables bitwise majority / AND / OR / NOT — a "blockwise computation engine" inside DRAM.
Real-world validation: testing on existing, unmodified DRAM by violating timing parameters demonstrates that copy, logical ops, and true RNG can often be performed reliably with significant throughput gains.

Architectural & Theoretical Shifts

Concept	Processor-Centric (now)	Memory-Centric (proposed)
Interface	Rigid; CPU controls all memory timing (e.g., refresh every 7.8µs)	Self-Managing DRAM: memory can signal "no" to CPU during internal tasks (refresh, RowHammer protection)
Complexity theory	Big-O on processor operations	Data-centric complexity models that count data movement
System roles	CPU/accelerator = master, memory = slave	Distributed system of equal, coordinating agents

Barriers to Adoption

Economic: trillion-dollar investment in processor-centric infrastructure creates entrenched dogma.
Standards: JEDEC committee (~390 companies) rarely converges on radical interface changes.
Review processes: Mutlu's "Self-Managing DRAM" paper was rejected 6× over 3.5 years before acceptance; revolutionary ideas are dismissed as "commercially unattractive" or requiring full-stack co-design.
Mindset gap: device-level physics (aging, spatial variation) still under-understood after 60 years of DRAM use; device-level and system-level research remain siloed.

Conclusion

Key Numbers

80–90% — time data-center processors spend waiting for memory
60% — system energy spent on data movement in Chrome, video codecs, TF inference
90% — system energy spent on memory/interconnect for LSTMs and transducers
800× — DRAM access energy vs 64-bit FP op
6,400× — DRAM access energy vs 32-bit int add
64,000× — storage+sensor+DRAM access vs actual computation
95% — hardware real estate dedicated to storing/moving data
7.8µs — current DRAM refresh cadence (CPU-imposed)
390 — approximate number of JEDEC member companies
6 rejections / 3.5 years — Self-Managing DRAM paper acceptance history

Related Work to Ingest Next

Mutlu group (SAFARI) arXiv papers: RowClone, Ambit, Self-Managing DRAM (MICRO 2024, Yaglikci/Luo/Mutlu), RowPress (arXiv:2406.16153), related survey (arXiv:2503.16749)
UPMEM / Qualcomm acquisition primary source ✓ confirmed June 2025 — CB Insights, PitchBook, design-reuse
Samsung HBM-PIM (Aquabolt-XL) — Hot Chips 33 paper, Samsung Newsroom; validated on Xilinx Alveo at 2.5× perf / 60% energy cut
SK Hynix AiM — GDDR6-AiM + AiMX 32 GB card; Hot Chips 2024 / AI HW Summit 2024; Llama 2 70B demo
RowHammer paper — won 2024 Jean-Claude Laprie Award (dependable computing)
JEDEC DDR5 / DDR6 activation counter specifications
CXL 3.x composable memory specifications
PIM-AI architectures survey (arXiv:2411.17309) — LLM inference-specific PIM designs

Verification Pass (2026-04-21)

Cross-checked key claims against public sources; evidence upgraded in the Processing-In-Memory concept page:

✅ UPMEM → Qualcomm acquisition confirmed (June 2025)
✅ Samsung Aquabolt-XL: 2.5× perf, 60% energy cut on Xilinx Alveo
✅ SK Hynix AiMX: 32 GB card, Llama 2 70B, 80% data-movement power saving
✅ RowPress: arXiv:2406.16153, commodity DDR4 bit flips
✅ Self-Managing DRAM: MICRO 2024, Yaglikci/Luo/Mutlu
✅ RowHammer paper: 2024 Jean-Claude Laprie Award
⚠️ 60–90% energy figures: still single-source (Mutlu framing of Google joint study); independent hyperscaler replication not yet located
⚠️ 800× / 6,400× / 64,000× DRAM vs compute energy ratios: widely cited but upstream source is Horowitz 2014 ISSCC keynote — need to verify whether figures have shifted meaningfully with modern nodes

RELATED · IN THE BASE