AI hardware is having its Cambrian explosion moment. Every month brings a new “AI accelerator” slide deck: wafer-scale SRAM monsters, exotic analog cores, resistive in‑memory compute, near‑memory PIM stacks, and now “high-bandwidth flash.” On paper, they all promise to “break the memory wall” and “redefine AI compute.” In practice, they all end up in the same place: bolting more DRAM and SRAM onto the problem and hoping users won’t notice that the bottleneck hasn’t moved.
We call this pattern the DRAM Trap. It’s the structural reality behind today’s AI hardware: no matter how creative your compute fabric looks, once you scale to real LLM workloads, you are bound by the energy, bandwidth, and capacity limits of DRAM/SRAM. All the “new architectures” are just different points on the same trade-off curve.
SEMRON’s bet is simple: the only way out is to change the memory substrate itself. Our answer is 3D CapRAM – a capacitive in‑memory computing fabric integrated into 3D NAND-class stacks that pushes past the DRAM trap instead of negotiating slightly better terms inside it.
All Is DRAM/SRAM
Strip away the branding, and today’s AI accelerators collapse into one common pattern: a fast but small SRAM front-end plus a DRAM (usually HBM) back-end.
-
Wafer-scale SRAM? As long as the model fits on the wafer, you win (as long as you are fine with paying for an entire wafer). The moment it doesn’t, you bolt on external DRAM and you are back on the wrong side of the wall.
-
HBM GPUs and PIM-in-DRAM? Higher bandwidth DRAM is still DRAM. You pay the same activation/refresh penalties, the same energy per bit, just with more channels and heavy cooling.
-
“Near-memory compute” and clever dataflows? They shuffle where you burn power on data movement, but they don’t eliminate per-weight digitization and shuttling.
Viewed through the right lens - energy per byte delivered vs arithmetic intensity (MACs per byte loaded) - these architectures all sit on a shared Pareto curve. SRAM buys energy and latency at atrocious density. HBM/HBM‑PIM sits in the messy middle. High-bandwidth flash pushes density further, but at a huge energy and latency tax. None of them escape the curve; they only pick different points on it.
Our position: as long as you stay on SRAM/DRAM/flash as digital storage feeding digital MACs, you are just tuning constants in the same broken equation.
The IMC Mirage: Analog Without Capacity
On paper, analog in‑memory computing (IMC) should have been the answer: compute where the weights live, avoid shuttling terabytes per second across a memory hierarchy, and ride physics to massive efficiency gains.
Instead, resistive IMC has largely stalled in the “cool demo, not deployable” phase:
-
Noise, nonlinearity, and variability kill accuracy once you scale beyond toy networks.
-
Energy-hungry ADCs often erase the theoretical gains.
-
Most importantly: capacity. Almost every “real” IMC system ends up backed by… external DRAM or HBM. Again.
This is why the big players treat resistive-IMC startups as science experiments, not strategic partners: if your technology can’t carry a multi‑billion‑parameter LLM with realistic KV‑cache traffic, you’re not solving their actual problem.
Our view is blunt: IMC without 3D NAND‑class capacity is a dead end for large AI workloads. The physics may be interesting; the system story isn’t.
HBF: High-Bandwidth Flash Is Not the Escape Hatch
High-bandwidth flash (HBF) is often pitched as the missing piece: take 3D NAND, wrap it in a fat interface, call it “near memory,” and suddenly the capacity problem is solved.
But 3D NAND was never designed to be the main working memory of a decode-heavy LLM:
-
Reads are slow and expensive in energy terms.
-
Writes are limited in endurance and latency.
Even under optimistic assumptions, an HBF module is power-limited to only a few full readouts per second per unit area, which directly caps tokens/sec in low-arithmetic intensity regimes. That makes HBF fine for cold parameter storage and archival embedding stores - and fundamentally wrong for the hot path of inference.
So you end up right back where everyone else does: flash for cold, DRAM/SRAM for hot. The DRAM trap, now with a more complicated diagram.
The Real Bottleneck: Decode, Not FLOPs
The industry still loves to talk in FLOPs and massive “theoretical throughput.” Our position paper argues that this is the wrong terminal metric.
For LLM inference, the only numbers that really matter are:
-
Tokens per second per user (latency), and
-
Tokens per dollar (dominated by energy and cooling, not silicon list price, when talking about datacenters).
Under realistic latency SLAs and batch sizes, the prefill phase is compute‑bound and photonics might help there someday, while memory architecture barely moves the needle. The real fight is in decode:
-
Decode approaches batch size ≈ 1 in user-facing scenarios, as soon as context lengths grow.
-
Arithmetic intensity is low: each weight byte only sees a handful of MACs before being evicted.
-
Bandwidth and capacity - not FLOPs - are the hard constraints.
This applies to both deployment contexts:
-
Datacenter: the optimization is “energy per token at fixed latency,” with power and cooling dominating TCO.
-
Edge/mobile: area, silicon cost, and power envelope are hard constraints; DRAM/SRAM cannot scale to interesting models without blowing up BOM and thermals.
Across both - for edge applications even prefill - low arithmetic intensity is the norm, not the exception. That’s exactly where DRAM-based architectures are worst — and where CapRAM is designed to be best.
CapRAM: Analog Compute With 3D NAND-Class Capacity
CapRAM starts from a different premise: you don’t fix the DRAM trap by trying to micro‑optimize digital data movement; you fix it by never moving the weights in the first place.
Technically, CapRAM is a memcapacitive device integrated into 3D NAND‑like vertical stacks that:
-
Performs MACs in the analog domain using capacitors, not resistive elements.
-
Achieves high accuracy and low noise, because capacitors are inherently more stable than RRAM/flash cells.
-
Is compatible with the existing 3D NAND process flow with only minor modifications, inheriting its vertical density and cost advantages.
-
Supports both non‑volatile (weights) and quasi‑volatile (KV‑cache) behavior in the same 3D fabric.
In practice, this means:
-
SRAM‑like energy efficiency at 3D‑NAND‑class density. At low arithmetic intensity - the hard regime - CapRAM matches SRAM’s energy per MAC while exceeding HBF’s density.
-
Bandwidth that scales with area without blowing up energy. Because weights never leave the array, “bandwidth” is really in‑place usage rate, not pins × GHz.
-
Decode latency that doesn’t depend on weight reuse. DRAM/HBM architectures only look good when you can reuse weights 10–100×; under realistic reuse (often ≪ 256 MAC/byte), their effective energy and throughput collapse. CapRAM’s processing time per bit is essentially independent of reuse.
This is not “HBM but slightly better,” nor “flash with a nicer PHY.” It is a different substrate with a different scaling law.
Moving the Pareto Frontier, Not Just Sliding Along It
When you plot today’s digital memories — wafer‑scale SRAM, HBM, HBM‑PIM, HBF — in (parameter density, energy efficiency) space, you get a well-defined Pareto frontier. Every architecture that “innovates” by shuffling logic around DRAM and flash just moves along this curve.
CapRAM is off that curve:
-
In density vs bandwidth-per-area, it sits near HBF on density but with dramatically higher memory‑limited throughput for realistic arithmetic intensities.
-
In energy per bit vs density, it matches or beats SRAM’s efficiency while giving you an order(s) of magnitude more capacity per mm².
Concretely, in our modeling and end‑to‑end simulations:
-
CapRAM delivers superior tokens-per‑Joule to HBM‑based and near‑memory designs across all evaluated LLMs.
-
For multi‑billion‑parameter models and decode‑heavy workloads, it provides higher throughput at fixed power than both mobile‑class accelerators and DRAM‑PIM architectures.
-
In the decode‑specialized regime - low reuse, KV‑cache dominated - CapRAM is orders of magnitude better in tokens/sec and energy/token than any DRAM/SRAM/HBF combination we examined.
This is the regime Ma & Patterson explicitly flag as the “hard open problem” for LLM hardware: bandwidth and interconnect latency dominate, and simply stacking more DRAM or wrapping NAND in a faster interface doesn’t fix it. CapRAM is designed to attack that exact problem, not route around it.
Why the Big Players Will Have to Move
Our thesis is not that CapRAM is a cute niche technology for specialized accelerators. It’s that the DRAM trap makes every existing roadmap fundamentally unsustainable:
-
Datacenter: Power and cooling already dominate TCO for large‑scale inference. Doubling energy efficiency is often worth more than doubling raw throughput, because the former compounds across years of 24/7 operation. You don’t get there by eking out 10–20% from DRAM hierarchies.
-
Edge/mobile: You can’t ship a phone, car, or XR device with a wafer‑scale SRAM or a rack of HBM. Density, BOM, and thermal envelopes are non‑negotiable. DRAM and on‑chip SRAM simply do not scale to the model sizes users now expect in these form factors.
At the same time, the NAND ecosystem is marching toward 300+ and eventually ~500 layers, with a mature industrial base pushing vertical integration forward regardless of what AI wants. CapRAM can ride that curve: every new 3D NAND process generation is more capacity and bandwidth for us, without requiring a new manufacturing universe.
DRAM and SRAM, by contrast, are already in the “squeezing blood from a stone” phase: incremental tweaks, exotic packaging, and rising cost per useful bit.
We don’t think this is a fair fight.
From Alternative to Default
The core message of our position paper is blunt:
As long as AI accelerators are built on DRAM/SRAM/flash as dumb digital storage feeding digital MACs, they will remain trapped on a bandwidth–capacity trade-off curve that is fundamentally misaligned with LLM inference. CapRAM’s 3D capacitive fabric moves us to a different curve.
As models grow, contexts lengthen, and both cloud and edge budgets hit hard physical and economic limits, we expect the industry to converge on the same conclusion our analysis reaches:
-
The “DRAM trap” is real and getting tighter.
-
Resistive IMC without capacity, and HBF without a new compute model, are side quests, not solutions.
-
3D capacitive in‑memory compute with NAND‑class density is not a nice‑to‑have — it’s the next logical substrate for AI.
That is the paradigm we’re building at SEMRON.