

# **Compute-In-Memory Design For Al**

**Bonan Yan**Peking University
June 16, 2025









## **Outline**

- Introduction of The Focused Paper
  - A 1.041Mb/mm<sup>2</sup> 27.38TOPS/W Signed-INT8 Dynamic-Logic-Based ADC-Less SRAM Compute-In-Memory Macro in 28nm with Reconfigurable Bitwise Operation for AI and Embedded Applications [ISSCC'2022]
- Our Subsequent Related Works
  - SRAM CIM [TCAS-I'2024, HPCA'2025]
  - RRAM CIM [Nature Electronics'2024]
  - Near-Memory Accelerator for Reinforcement Learning [DATE'2025]

## **Motivation**



## Challenges

Challenge 1: Large Compute Circuit Area

Challenge 2: Limited CIM Function

Challenge 3: CIM to DNN Deploy



**Need Smaller Compute Circuits & CIM Architecture** 

# Challenges

Challenge 1: Large Compute Circuit Area

Challenge 2: Limited CIM Function

Challenge 3: CIM to DNN Deploy

### Conventional:



Need reconfigurable algorithmic/logic e.g. XOR, OR, IMPLY, etc.

## Challenges

Challenge 1: Large Compute Circuit Area

Challenge 2: Limited CIM Function

Challenge 3: CIM to DNN Deploy



Direct mapping has low utilization



low weight spatial utilization rate cause low average energy efficiency



## **Key Contributions**

Challenge 1: Large compute circuit area



ADC-Less Vector-Hadamard-Product (VHP) Architecture



 $\begin{bmatrix} W_{0,0} & \cdots & W_{0,n} \\ \vdots & \ddots & \vdots \\ W_{m,0} & \cdots & W_{m,n} \end{bmatrix} = \begin{bmatrix} \sum_i X_i \cdot W_{i,0} & \cdots & \sum_i X_i \cdot W_{i,m} \end{bmatrix}$ 

2 Vector-Matrix Multiply (VMM)

Challenge 2: Limited CIM function



Reconfigurable Local Process Units (RLPU) +Dynamic Logic Compute Circuits (DCC)



Challenge 3: CIM to DNN deploy



Flexible VHP/VMM Support With Post Sum Circuits



## **Key Contributions**







## **Compartment-Based Macro Organization**

Separate ports to support Vector Hadamard Product (VHP) operations:



## **Macro VHP Architecture**

Compartmentbased vector element selection

Effective vector-matrix multiplication (VMM)



# **Key Contributions**





Reconfigurable Local Process Units (RLPU) +Dynamic Logic Compute Circuits (DCC)



Reconfigurable to AND OR XOR

# Flexible VHP/VMM Support With Post Sum Circuits



# **Principle of RLPU+DCC**

#### **Inside a compartment** BLP BLN WL select 1 out of 16 per compartment compartment **W**i,j[0] Wi,j[7] Wi,j[6] Select SRAM one 6T Bitcell DCC column . . . ... ... readout **VHP** VDD data Results BLP BLN input Single-Bit RLPU √MN2 DCC INP **ICL** DLO[15:0] Input Combinatory Logic (ICL) NNI data output Reconfigure through ICL RLPU: Reconfigurable Local Process Units Available bitwise logic options: DCC: Dynamic logic Compute Circuits

AND, OR, XOR

# **Principle of RLPU+DCC**

Take RLPU configured to AND as an example





# **Principle of RLPU+DCC**

Take RLPU configured to AND as an example





# **RLPU Reconfigurability**

### All possible bitwise configurations

Dynamic Logic Output =  $\overline{INP \cdot w + INN \cdot \overline{w}}$ 

| RPLU<br>Logic | Config<br>via ICL       | IN | INF | INN | w | w | Dynamic Logic<br>Output DLO[k] | RPLU Logic<br>Result Out[k] | Applications    | Element multiplication |  |
|---------------|-------------------------|----|-----|-----|---|---|--------------------------------|-----------------------------|-----------------|------------------------|--|
|               |                         | 0  | 0   | 0   | 0 | 1 | 1                              | 0                           | , the module is | -Vector matrix         |  |
| AND           | INP=IN                  | 1  | 1   | 0   | 0 | 1 | 1 inve                         | ortod 0                     |                 | multiplications        |  |
| (multiply)    | INN=0                   | 0  | 0   | 0   | 1 | 0 | 1                              | erted 0                     |                 | -Hadamard product      |  |
| (manupiy)     |                         | 1  | 1   | 0   | 1 | 0 | 0                              | 1                           |                 |                        |  |
|               | INP= <u>0</u><br>INN=IN | 0  | 0   | 1   | 0 | 1 | 0                              | 0                           |                 |                        |  |
| OR            |                         | 1  | 0   | 0   | 0 | 1 | 1 <u>n</u>                     | oț 1                        |                 | CIM data mask          |  |
|               |                         | 0  | 0   | 1   | 1 | 0 | 1 linve                        | erted 1                     |                 | Cilvi data iliask      |  |
|               |                         | 1  | 0   | 0   | 1 | 0 | 1                              | 1                           |                 |                        |  |
|               | INP=IN<br>INN=IN        | 0  | 0   | 1   | 0 | 1 | 0                              | 0                           |                 |                        |  |
| XOR           |                         | 1  | 1   | 0   | 0 | 1 | 1 <u>n</u>                     | ot 1                        |                 | Hamming distance       |  |
|               |                         | 0  | 0   | 1   | 1 | 0 | 1 inve                         | erted 1                     |                 | Computation            |  |
|               |                         | 1  | 1   | 0   | 1 | 0 | 0                              | 0                           |                 | Computation            |  |

## **DCC Characteristics**



#### Possible Weakness??



[1] X. S., et al., ISSCC'2020

### **DCC Measurement**



Common compute time is <10ns

## **Entire Architecture**

### Novelty:

- Compartment -based organization
- · VHP architecture
- DCC+RLPU circuits



## **Key Contributions**







## **Introduction of Post-Sum Circuits**



## **Post-Sum Circuits**



- Shift & Add: Combine Bit-Serial Inputs
- Adder Tree: Convert Hadamard Products Into VMM Results

We Have A Little Trick Here: Arrange A ∑9 Here

∑k: sum of the k numbers

# Why ∑k?



# Why ∑k?



# Why ∑k?



# How ∑k?



■ ∑4: for 2\*2 Conv Kernel

■ ∑9: for 3\*3 Conv Kernel

Three groups of ∑9 and One 54 fits in a 32-element output vector

Why "32": our design has 32 compartments

additional Σ4 result to improve utilization rate

## ∑k Benefits



- Increase Utilization Rate
  - (Regular Shaped) CIM Macro is Naturally Good At Dense (Fully-Connected) Layer
    - ∑9 is Added Specially for Ubiquitous 3\*3 Conv Kernel

### Measurement





### Measurement

#### Measured Power Breakdown



### **Testing Configuration:**

- Average values w/ precision (input, weight)=(8b,8b)
- Benchmarked by conv & fc layers of quantized ResNet-34 & MobileNet, data reload not included



# Comparison

### **On Form Factor**

| CIM Macro Type                          |              | Analo                          | g CIM                          | Digital CIM                    |                               |                                |                      |
|-----------------------------------------|--------------|--------------------------------|--------------------------------|--------------------------------|-------------------------------|--------------------------------|----------------------|
| Work                                    | ISSCC'18 [1] | ISSCC'20 [2]                   | JSSC'21 [3]                    | ISSCC'21 [4]                   | ESSCIRC'19 [5]                | ISSCC'21[6]                    | This Work            |
| Technology                              | 65nm         | 28nm                           | 7nm                            | 28nm                           | 65nm                          | 22nm                           | 28nm                 |
| Array Size                              | 4Kb          | 64Kb                           | 4Kb                            | 384Kb                          | 16Kb                          | 64Kb                           | 32Kb                 |
| Cell Type                               | S6T          | 6T                             | 8T                             | 6T                             | 6T                            | 6T                             | 6T                   |
| Macro Area                              | N/A          | 0.362mm <sup>2</sup>           | 0.0032mm <sup>2</sup>          | 1.4mm <sup>2†</sup>            | 0.2272mm <sup>2</sup>         | 0.202mm <sup>2</sup>           | 0.030mm <sup>2</sup> |
| CIM Weight Density                      | N/A          | 177Kb/mm <sup>2</sup><br>@28nm | 1250Kb/mm <sup>2</sup><br>@7nm | 234Kb/mm <sup>2</sup><br>@28nm | 71Kb/mm <sup>2</sup><br>@65nm | 317Kb/mm <sup>2</sup><br>@22nm | 1067Kb/mm²<br>@28nm  |
| CIM Weight Density (normalized to 28nm) | N/A          | 177Kb/mm <sup>2</sup>          | 78Kb/mm <sup>2</sup>           | 234Kb/mm <sup>2</sup>          | 383Kb/mm <sup>2</sup>         | 196Kb/mm <sup>2</sup>          | 1067Kb/mm²           |
| Power Supply                            | 1V, 0.8V     | 0.7V-0.9V                      | 1V, 0.8V                       | 0.7-0.9V                       | 0.6V-0.8V                     | 0.72V                          | 0.8V                 |

<sup>†</sup> Estimated from [4]

# Comparison

## **On Function Diversity**

| CIM Macro Type                  |              | Analo            | g CIM       | Digital CIM  |                      |                      |                                  |
|---------------------------------|--------------|------------------|-------------|--------------|----------------------|----------------------|----------------------------------|
| Work                            | ISSCC'18 [1] | ISSCC'20 [2]     | JSSC'21 [3] | ISSCC'21 [4] | ESSCIRC'19 [5]       | ISSCC'21[6]          | This Work                        |
| Technology                      | 65nm         | 28nm             | 7nm         | 28nm         | 65nm                 | 22nm                 | 28nm                             |
| Compute Circuits                | CMI-VSA      | LMAR-SAR-<br>ADC | Flash ADC   | Ph-ADC       | CMOS Static<br>Logic | CMOS Static<br>Logic | Dynamic Logic<br>Compute Circuit |
| Support Bitwise<br>Operation    | AND          | AND              | AND         | AND          | AND                  | AND                  | AND, XOR, OR                     |
| Fundamental Vector<br>Operation | VMM          | VMM              | VMM         | VMM          | VMM                  | VMM                  | Hadamard<br>Product, VMM         |

# Comparison

### On Computation & Efficiency

| CIM Macro Type                             |              | Analo                          | g CIM                        | Digital CIM                  |                       |                              |                                |
|--------------------------------------------|--------------|--------------------------------|------------------------------|------------------------------|-----------------------|------------------------------|--------------------------------|
| Work                                       | ISSCC'18 [1] | ISSCC'20 [2]                   | JSSC'21 [3]                  | ISSCC'21 [4]                 | ESSCIRC'19 [5]        | ISSCC'21[6]                  | This Work                      |
| Input Bits                                 | 1            | 4b/8b                          | 4b                           | 4b/8b                        | 1b-16b                | 1b-8b                        | 1b-8b                          |
| Weight Bits                                | 1            | 4b/8b                          | 4b                           | 4b/8b                        | 4b/8b/12b/16b         | 4b/8b/12b/16b                | 1b/4b/8b                       |
| Output Bits                                | 1            | 12b (4b/4b)<br>20b (8b/8b)     | 4b                           | 12b (4b/4b)<br>20b (8b/8b)   | 8b-23b                | 16b (4b/4b)<br>24b (8b/8b)   | VMM: 21b<br>VHP: 8b<br>(8b/8b) |
| Cycle Time                                 | 2.3ns        | 4.1ns (4b/4b)<br>8.4ns (8b/8b) | 4.5ns (1.0V)<br>5.5ns (0.8V) | 4ns (4b/4b)<br>7.2ns (8b/8b) | N/A                   | 10ns (4b/4b)<br>18ns (8b/8b) | 3ns/input bit                  |
| Energy Efficiency** (TOPS/W)               | 55.8 (1b/1b) | 58.1(4b/4b)<br>14.1(8b/8b)     | 351 (1b/1b)                  | 77.3(4b/4b)<br>18.9(8b/8b)   | 117.3 (1b/1b)         | 24.7 (8b/8b)                 | 27.38 (8b/8b)                  |
| CIM Weight Density<br>(normalized to 28nm) | N/A          | 177Kb/mm <sup>2</sup>          | 78Kb/mm²                     | 234Kb/mm <sup>2</sup>        | 383Kb/mm <sup>2</sup> | 196Kb/mm²                    | 1067Kb/mm²                     |
| SWaP (FoM)***<br>(TOPS/W·Mb/mm²)           | N/A          | 161                            | 26.7                         | 283                          | 8.13                  | 303                          | 1826                           |

<sup>\*\*</sup> Average energy efficiency, no sparsity employed \*\*\*. The figure of merit (FoM) SWaP (space, wattage and performance)=(Weight Density\*Performance)/Power=Weight Density\*Energy Efficiency

## **Summary For This Digital SRAM CIM Design**



- VHP Organization Architecture
  - Increase Flexibility
- RLPU+DCC Design
  - Simplified ADC-Less CIM Operation
- ∑k Technique for CIM Macro Post-Sum Circuits
  - Increase Conv Kernel Utilization Rate

Chip Demo Video:

https://www.bilibili.com/video/BV15b4y1H7gP

## Subsequent Works: PIM SoC Open-Source Project

PIM-SoC Design: (graduate course project @ PKU)

- PIM Macros (Verilog Behaviors)
- Programmable Interface



Available at: <a href="https://github.com/rw999creator/gpp-pim">https://github.com/rw999creator/gpp-pim</a> https://arxiv.org/abs/2411.13054  Simulate Complex Dataflows w/ instructions



6/24/25

## **SRAM CIM For Markov Chain Monte Carlo Sampler**

Algorithm

-ramework

### **Emerging Emboddied Al "Think&Estimate" Applications**



#### **Our Solution: PROCA Architecture**



- Markov Chain Monte Carlo (MCMC) Acceleration
- Speedup 172~4871× vs. Intel Xeon Gold CPU
- Speedup 42~1058× versus NVIDIA A100 GPU

#### Deep Learning (DL)

Densenet, Efficientnet, Resnet, LlaMA, LlaMA2, Mixtral, GPT3, GPT4, MobileNet, ShuffleNet, RegNet, SqueezeNet, ViT, Transformer, MNASNet, GoogleNet,...

#### **Probabilistic Al**

Bayesian Regression, VAE, Hidden Markov Models, Normalizing Flows, Deep Markov Model, Causal Effect VAE, Gaussian Mixture Model, Gibbs Sampling, Metropolis-Hasting, Probabilistic Circuits, ...







Lack Chip Infrastructure!

Yihan Fu, et al., TCAS-l'2025 Yihan Fu, et al., HPCA'2025

## **RRAM CIM Based Universal Ising Machine**

NP-Hard Combinatorial **Optimizations Problems** 

Deploy to Ising Graphs



$$H_P = \sum_{i,j=1}^N J_{ij} \sigma_i \sigma_j + \sum_{i=1}^N h_i \sigma_i$$

Simulated Annealing Methods to Find Optimal Solutions



Graph Coloring in EDA (layout decomposition)



Develop RRAM-PIM Chip for Universal Ising Machine







# **Near-Memory Computing Accelerator for RL**

- PEARL: FPGA-based Reinforcement Learning Environment Accelerator
- https://github.com/Selinaee/FPGA Gym



Challenges: GPU can accelerator policy computing, but **NOT** environment update

Introduce near-memory pipelining to accelerate RL rollout,

## More Ways to Know Us: bonany.cc

- 1. [HPCA'25] Fu, Yihan, et al. "PROCA: Programmable Probabilistic Processing Unit Architecture with Accept/Reject Prediction & Multicore Pipelining for Causal Inference".
- 2. [DATE'25] Li, Jiayi, et al. "PEARL: FPGA-Based Reinforcement Learning Acceleration with Pipelined Parallel Environments".
- 3. [Nature Communications'25] Yue, Wenshuo, et al., "Physical Unclonable In-Memory Computing for Simultaneous Protecting Private Data and Deep Learning Models".
- 4. [Nature Electronics'24] Yue, Wenshuo, et al., "A Scalable Universal Ising Machine Based on Interaction-Centric Storage and Compute-In-Memory." 7, 904–913 (2024).
- 5. [IEEE TCAS-I'24] Fu, Yihan, etc. "Probabilistic Compute-in-Memory Design for Efficient Markov Chain Monte Carlo Sampling." vol. 71, no. 2, pp. 703-716, 2024.
- 6. [ISSCC'22] Yan, Bonan\*, et al., 2022. "A 1.041-Mb/mm<sup>2</sup> 27.38-TOPS/W Signed-INT8 Dynamic-Logic-Based ADC-Less SRAM Compute-in-Memory Macro in 28nm with Reconfigurable Bitwise Operation for AI and **Embedded Applications.**"

- 1 SRAM-PIM Chip Function Demo
- (2) FPGA-Gym for RL Framework





### Acknowledgement

















6/24/25 37