Samsung Stuffs 1.2TFLOP AI Processor Into HBM2 to Boost Efficiency, Speed

This site may earn affiliate commissions from the links on this page. Terms of use.

Samsung has announced the availability of a new Aquabolt variation. Unlike the typical clock speed jump or capacity improvement you’d expect, this new HBM-PIM can perform calculations directly on-chip that would otherwise be handled by an attached CPU, GPU, or FPGA.

PIM stands for Processor-in-Memory, and it’s a noteworthy achievement for Samsung to pull this off. Processors currently burn an enormous amount of power moving data from one location to another. Moving data takes time and costs power. The less time a CPU spends moving data (or waiting on another chip to deliver data), the more time it can spend performing computationally useful work.

CPU developers have worked around this problem for years by deploying various cache levels and integrating functionality that once lived in its own socket. Both FPUs and memory controllers were once mounted on the motherboard rather than directly integrated into the CPU. Chiplets actually work directly against this aggregation trend, which is why AMD has had to be careful that its Zen 2 and Zen 3 design could boost overall performance while disaggregating the CPU die.

If bringing the CPU and memory closer together is good, building processing elements directly into memory would be even better. Historically, this has been difficult because logic and DRAM are typically built very differently. Samsung has apparently solved this problem, and it’s leveraged the die-stacking capabilities of HBM to keep available memory density sufficiently high to interest customers. Samsung claims it can deliver a more than 2x performance improvement with a 70 percent power reduction at the same time, with no required hardware or software changes. The company expects validation to be complete by the end of the first half of this year.

Image by THG

THG has some details about the new HBM-PIM solution, gleaned from Samsung’s ISSCC presentation this week. The new chip incorporates a Programmable Computing Unit (PCU) clocked at just 300MHz. The host controls the PCU via conventional memory commands and can use it to perform FP16 calculations directly in-DRAM. The HBM itself can operate either as normal RAM or in FIM mode (Function-in-Memory).

Including the PCU reduces the total available memory capacity, which is why the FIMDRAM (that’s another term Samsung is using for this solution) only offers 6GB of capacity per stack instead of the 8GB you’d get with standard HBM2. All of the solutions shown are built on a 20nm DRAM process.

Image by THG

Samsung’s paper describes the design as “Function-In Memory DRAM (FIMDRAM) that integrates a 16-wide single-instruction multiple-data engine within the memory banks and that exploits bank-level parallelism to provide 4× higher processing bandwidth than an off-chip memory solution.”

Image by THG.

One question Samsung hasn’t answered is how it deals with thermal dissipation, a key reason why it’s been historically difficult to build processing logic inside DRAM. This could be doubly difficult with HBM, in which each layer is stacked on top of another. The relatively low clock speed on the PIM may be a way of keeping DRAM cool.

We haven’t seen HBM deployed for CPUs much, Hades Canyon notwithstanding, but multiple high-end GPUs from Nvidia and AMD have tapped HBM/HBM2 as primary memory. It’s not clear if a conventional GPU would benefit from this offload capability, or how such a feature would be integrated into the GPUs own impressive computational capacity. If Samsung can offer the performance and power improvements it claims to a range of customers, however, we’ll undoubtedly see this new HBM-PIM popping up in products a year or two from now. A 2x performance boost coupled with a 70 percent power consumption decrease is the kind of old-school improvement lithography node transitions used to deliver on a regular basis. It’s not clear if Samsung’s PIM will specifically catch on, but any promise of a classic full-node improvement will draw attention, if nothing else.

Now Read:

Comments are closed.