Near Memory Computing in 3D ICs

Processing in Momory (PIM)

As the demand on high computing performance for artificial intelligence is increasing, parallel processing accelerators such as GPU and TPU are the key to determining system performance. In this trend, the number of cores in accelerators is continuously increasing for performance scaling, which requires more off-chip memory bandwidth and area for increased cores. As a result, it not only increases the energy consumed by interconnection, but also limits system performance by insufficient off-chip memory bandwidth. In order to overcome the limitation, Processing-In-Memory (PIM) architecture is re-emerged. PIM architecture is the integration of processing units with memory, which can be implemented by 3D-stack memory such as high bandwidth memory (HBM) without disadvantages of differences from memory process and processor process.

Our lab’s AI hardware group focused on the optimized design of PIM architecture and interconnection for 3D stacked PIM-HBM considering signal integrity (SI) / power integrity (PI). In order to provide high memory bandwidth to the PIM core using through silicon via (TSV), data rates or the number of TSV should be increased. However, more than 30% of DRAM area is already occupied by TSV, and data rates of TSV is determined by channel performance. Therefore, optimal design of TSV channel and placement considering SI should be essential for small area and high bandwidth. In addition, appropriate PIM cores must be embedded in the HBM logic die for system performance improvement. As the number of PIM cores increases for high performance, more area of logic die is required. Memory bandwidth for host processor is decreased by increased interposer channel length. Consequently, design of PIM-HBM logic die and interposer channel must be optimized for system performance without degradation of interposer bandwidth for host processor.

Through system level optimization as mentioned above, our PIM-HBM architecture can achieve high energy-efficiency by drastically reducing interconnection lengths and improve system performance in memory-limited applications by increasing internal TSV bandwidth.

Fig. 1 Conceptual view of 3D stacked PIM architecture based on HBM (PIM-HBM)

Fig. 2 Floorplan of heterogeneous PIM-HBM architecture

Neuromorphic Chip

The most important part of artificial intelligence calculations is huge parallel vector-matrix multiplication. Such an operation method is inefficient in terms of calculation time and power in the conventional Von-Neumann computing architecture. This is because the data for the operation must be fetched from off-chip memory, which consumes a lot of interconnection power, every clock cycle. So, various AI hardware operation architectures are emerging to solve this problem. Among them, the architecture that is attracting much attention is a computing architecture that can simultaneously operate and store in a crossbar form using ReRAM. This architecture can reduce the access to off-chip memory for data fetch and calculate vector-matrix multiplication directly by reading current from the value of voltage and resistor of ReRAM. Thus, calculation for AI can be done very efficiently based hardware structure target.

Our lab’s AI hardware group focused on the design of optimized computing architecture and interconnection considering signal integrity (SI)/ power integrity (PI) for accurate hardware operation. Generally, ReRAM crossbar array has smaller size than the number of input neurons in a filter layer. But large crossbar array has a large IR drop and interconnection energy, so we optimize these array size considering lots of these factors. After this, we analyze SI/PI issues such as crosstalk noise between crossbar wires and power/ground noise that can affect to ReRAM resistance change and calculation at high speed. These noise can cause a large malfunction in the calculation of the small read voltage margin. Finally, we suggest design guide of ReRAM crossbar array for hardware AI operation.