

# A 28nm 83.23TFLOPS/W POSIT-Based Compute-in-Memory Macro for High-Accuracy Al Applications

Yang Wang<sup>1</sup>, Xiaolong Yang<sup>1</sup>, <u>Yubin Qin<sup>1</sup></u>, Zhiren Zhao<sup>1</sup>, Ruiqi Guo<sup>1</sup>, Zhiheng Yue<sup>1</sup>, Huiming Han<sup>1</sup>, Shaojun Wei<sup>1</sup>, Yang Hu<sup>1</sup>, Shouyi Yin<sup>1,2</sup>

<sup>1</sup>Tsinghua University, Beijing, China <sup>2</sup>Shanghai Al Laboratory, Shanghai, China





- Background and Motivation
- Challenges of POSIT-Based CIM Macro
- Proposed POSIT@CIM Macro Features
  - Bi-directional Regime Processing Codec
  - Critical-bit Pre-compute-and-store CIM Array
  - Cyclically-alternating Scheduling Adder Tree
- Measurement and Comparison
- Conclusion

## FP-CIM for High-accuracy Al Applications



- Recent AI tasks are becoming increasingly complex.
- Complex Al application requires FP-CIM for high accuracy.

#### **Limitation of Conventional FP Data Format**



Conventional FP cannot achieve high accuracy with low power.

#### **Principle of POSIT Data Format**



POSIT exploits dynamic bit to adapts to varied distributions.

#### **Conventional FP VS. POSIT**



■ POSIT8 saves 27% energy with 0.4% accuracy loss than FP16.

- Background and Motivation
- Challenges of POSIT-Based CIM Macro
- Proposed POSIT@CIM Macro Features
  - Bi-directional Regime Processing Codec
  - Critical-bit Pre-compute-and-store CIM Array
  - Cyclically-alternating Scheduling Adder Tree
- Measurement and Comparison
- Conclusion

# Challenge 1: Large Power in Regime Processing





Dynamic regime increases 2.62 × pre-processing energy.

## Challenge 2: Cell Under-utilization in CIM Array



Dynamic mantissa introduces 41.3% CIM cell underutilization.

## Challenge 3: Redundant Toggle in Adder Tree





2025 MiM Webinar

Redundant Logic Toggle Power Consumption in Adder Tree

 $174.38 \text{ nW} \quad A + B = A \mid B$ 29.75nW

(16b Simulation@400HMz,0.9V)

| Madala   | Adder Tre | Ratio   |       |  |
|----------|-----------|---------|-------|--|
| Models   | Total     | Redun.  | Ratio |  |
| ResNet18 | 0.11mJ    | 0.083mJ | 76.3% |  |
| GPT-2    | 5.8mJ     | 3.1mJ   | 53.5% |  |
| ViT-B    | 1.1mJ     | 0.63mJ  | 57.6% |  |

Dynamic aligned accumulation incurs 66.8% power waste.

- Background and Motivation
- Challenges of POSIT-Based CIM Macro
- Proposed POSIT@CIM Macro Features
  - Bi-directional Regime Processing Codec
  - Critical-bit Pre-compute-and-store CIM Array
  - Cyclically-alternating Scheduling Adder Tree
- Measurement and Comparison
- Conclusion

#### **Overall Architecture of POSIT CIM Macro**



- BRPU replaces regime codec with Shift and OR logic to save regime pre-processing energy.
- CPCS CIM Array exploits spare bits to perform dual-bit MAC to increase CIM utilization.
- CASU simplifies addition logic to bit-wise OR operations to reduce accumulation power.

- Background and Motivation
- Challenges of POSIT-Based CIM Macro
- Proposed POSIT@CIM Macro Features
  - Bi-directional Regime Processing Codec
  - Critical-bit Pre-compute-and-store CIM Array
  - Cyclically-alternating Scheduling Adder Tree
- Measurement and Comparison
- Conclusion

| Binary | 00001 | 0001x | 001xx | •••• | 110xx | 1110x | 11110 |
|--------|-------|-------|-------|------|-------|-------|-------|
| Regime | -4    | -3    | -2    | •••• | 1     | 2     | 3     |

0's count for neg. R (0001 for -3)

1's count sub 1 for pos. R (1110 for 2)

$$A \times B = (S_A \times S_B) \times (2^K)^{(R_A + R_B)} \times 2^{(E_A + E_B)} \times (1.f_A \times 1.f_B)$$



- Step1: Regime extracting with leading 1/0 detector.
- Step2: Regime processing with codec and addition.



■ BRPU replaces codec-addition with shift-or processing.

Simulation with TSMC Same Sign R<sub>A</sub> + R<sub>B</sub> 28nm Technology at 400MHz Power (uW) R<sub>A</sub>:11110 R<sub>1</sub>>>2 Range[0,16] Decoder Average: 0.15uW shift code Small |R| Static Decoding 16 24

- BRPU dynamically decodes small  $|R_B|$  to shifts large  $|R_A|$ .
- BRPU minimizes shift code to saves 40% of shift energy.



- Different sign addition: logic shift to decrease 1's/0's counts.
- If shift code ≥ R's effective bit-width, it introduces shift error.



- BRPU dynamically decodes small  $|R_B|$  to shifts large  $|R_A|$ .
- BRPU avoids shift overflow to reduce 50% of shift logic.

- Background and Motivation
- Challenges of POSIT-Based CIM Macro
- Proposed POSIT@CIM Macro Features
  - Bi-directional Regime Processing Codec
  - Critical-bit Pre-compute-and-store CIM Array
  - Cyclically-alternating Scheduling Adder Tree
- Measurement and Comparison
- Conclusion

#### Mantissa Distribution of Posit Format Weight for ResNet18 Training

| Posit(8,1) | 2b(15.8%) | 3b(46.1%) | 4b(31.6%) | others |
|------------|-----------|-----------|-----------|--------|
| Posit(8,2) | 2b(10.6%) | 3b(51.6%) | 4b(34.2%) | others |



■ Dynamic mantissa bit-width introduces 48.9% cell waste.



CPCS uses spare bits to achieve dual-bit MAC in each cycle.



2025 MiM Webinar





Pre-computer only works one time before storing weight.

- Background and Motivation
- Challenges of POSIT-Based CIM Macro
- Proposed POSIT@CIM Macro Features
  - Bi-directional Regime Processing Codec
  - Critical-bit Pre-compute-and-store CIM Array
  - Cyclically-alternating Scheduling Adder Tree
- Measurement and Comparison
- Conclusion

**Bit-wise OR-based Accumulation Overlap Ratio Data Format CIM Array** 30 [POSIT(8,2)] A+B $A_1 \times W_1$  $A_0 \times W_0$ 20 A **4**%  $A + B = A \mid B$ Shift B **Alignment** 10 No Overlap of A, B A|B5 0 3  $|E_A - E_B|$ Overlap bits

■ If A and B have no overlap bits, A + B is equal to A | B.



Even A₀/A₁ have 1 overlap bit, A₀W₀+A₁W₁ has to use adder.



All cycles need adders for synchronous bit-serial computing.



■ CASU cyclically shifts  $A_0$  for asynchronous computing with  $A_1$ .



■ CASU eliminates overlap bits in former cycles of  $A_0W_0+A_1W_1$ .



■ CASU saves 56.9% of accumulation energy for adder tree.

- Background and Motivation
- Challenges of POSIT-Based CIM Macro
- Proposed POSIT@CIM Macro Features
  - Bi-directional Regime Processing Codec
  - Critical-bit Pre-compute-and-store CIM Array
  - Cyclically-alternating Scheduling Adder Tree
- Measurement and Comparison
- Conclusion

#### **Chip Photograph and Summary**





|                                              |                                    | Specifications           |                          |  |  |
|----------------------------------------------|------------------------------------|--------------------------|--------------------------|--|--|
| Tech                                         | nnology                            | 28nm CMOS                |                          |  |  |
| Die                                          | e Area                             | 1.41 mm <sup>2</sup>     |                          |  |  |
| CII                                          | M Size                             | 12 KB                    |                          |  |  |
| Buff                                         | fer Size                           | 28 KB                    |                          |  |  |
| Vo                                           | ltage                              | 0.55V-                   | 1.0V                     |  |  |
| Free                                         | quency                             | 78MHz-4                  | 19MHz                    |  |  |
| Data I                                       | Precision                          | Posit(8,1)               | Posit(16,2)              |  |  |
| Peak Performance <sup>1)</sup>               |                                    | 3.86TOPS                 | 1.91TOPS                 |  |  |
| Area Efficiency                              |                                    | 2.74TOPS/mm <sup>2</sup> | 1.35TOPS/mm <sup>2</sup> |  |  |
| Cim Micro<br>Energy Efficiency <sup>2)</sup> |                                    | 16.34-83.23<br>TOPS/W    | 7.47-38.37<br>TOPS/W     |  |  |
| System<br>Energy Efficiency <sup>2)</sup>    |                                    | 10.90-55.60<br>TOPS/W    | 5.35-27.61<br>TOPS/W     |  |  |
|                                              | ResNet18³)<br>@Imagenet1k          | 69.64%<br>(Top1 Acc↑)    | 69.71%<br>(Top1 Acc↑)    |  |  |
| Differerent<br>Al Models                     | GPT-2 <sup>4)</sup><br>@Wikitext-2 | 21.31<br>(Perplexity↓)   | 21.45<br>(Perplexity↓)   |  |  |
|                                              | VIT-B <sup>5)</sup><br>@Imagenet1k | 80.17%<br>(Top1 Acc↑)    | 80.05%<br>(Top1 Acc↑)    |  |  |

One operation (OP) represents one multiplication or addition.

- 1) Highest performance (lowest effiency) point, 1.0V, 419MHz
- 2) Highest efficiency point, 0.65V, 78MHz, 50% input sparsity
- 3) The baseline is 69.76%
- 4) The baseline is 21.30
- 5) The baseline is 80.31%

#### **Area and Power Breakdown**



- BRPU and CASU take limited area (22.5%) and power (21.8%)
- CIM Array takes most area (39.9%) and power (41.2%)

#### Training and Inference Performance

#### Evaluation on Different Models<sup>1)</sup>

| Model                                      | ResNet18                          | GPT-2              | VIT-B             |  |
|--------------------------------------------|-----------------------------------|--------------------|-------------------|--|
| Dataset                                    | Imagenet-1k                       | Wikitext-2         | Imagenet-1k       |  |
| Task                                       | Training                          | Inference          | Inference         |  |
| Data Precision                             | Posit(16,2)                       | Posit(8,1)         | Posit(8,1)        |  |
| Accuracy                                   | 69.71% (Top1 Acc)                 | 21.31 (Perplexity) | 80.17% (Top1 Acc) |  |
| Accuracy Loss <sup>2)</sup>                | -0.04%                            | -0.05%             | -0.14%            |  |
| Performance(TOPS) <sup>3)</sup>            | 1.73                              | 2.95               | 2.91              |  |
| Cim Micro<br>Energy Efficiency<br>(TOPS/W) | 16.27                             | 34.30              | 33.81             |  |
| System<br>Energy Efficiency<br>(TOPS/W)    | Energy Efficiency 10.45           |                    | 21.48             |  |
| Energy Saving⁵)                            | Energy Saving <sup>5)</sup> 8.81x |                    | 7.12x             |  |



- POSIT-CIM only incurs 0.14% of accuracy loss than FP32.
- It achieves 10.45TFLOPS/W of average energy efficiency.

<sup>1)</sup> Measured at 1.0V, 419 MHz for high-performance evaluation. 2) Compared with models training in FP32.

<sup>3)</sup> Include all on-chip components,. Off-chip memory is not included.

#### **Performance Comparison**

#### Comparison with SOTA FP CIM Macros

|                                | VLSI'21[1]              | ISSCC'22[2]             | ISSCC'23[3]             | ISSCC'23[4]                                              | ISSCC'21[5]                                                | ISSCC'21[6]                     | This Work                                                 |
|--------------------------------|-------------------------|-------------------------|-------------------------|----------------------------------------------------------|------------------------------------------------------------|---------------------------------|-----------------------------------------------------------|
| Dynamic Format                 | NO                      | NO                      | NO                      | NO                                                       | NO                                                         | YES                             | YES                                                       |
| Technique (nm)                 | 28                      | 28                      | 22                      | 28                                                       | 28                                                         | 28                              | 28                                                        |
| Die Area (mm²)                 | 5.83                    | 6.69                    | 18                      | 0.146                                                    | 4.54                                                       | 3.8                             | 1.41                                                      |
| Supply Voltage (V)             | 0.76-1.1                | 0.6-1.0                 | 0.6-0.8                 | 0.6-0.9                                                  | 0.397-0.90                                                 | 0.6-0.9                         | 0.55-1.0                                                  |
| Frequency (MHz)                | 250                     | 50-220                  | NA                      | NA                                                       | 10-400                                                     | 104-288                         | 78-419                                                    |
| Precision                      | BF16                    | FP32/BF16<br>INT16/INT8 | BF16                    | BF16<br>INT8                                             | FP16/BF16<br>INT8/4                                        | 32b/16b/8b<br>CUSTOM POSIT      | POSIT16<br>POSIT8                                         |
| Power (mw)                     | 1.2-156.1 <sup>1)</sup> | 12.5-69.4               | NA                      | NA                                                       | 0.87-74.9                                                  | 50-230                          | 5.5-237                                                   |
| Performance<br>(TOPS)          | 0.12-0.66 <sup>1)</sup> | 0.14@FP32<br>1.35@INT8  | 1.24-1.28               | NA                                                       | 1.64-9.63 <sup>3)</sup> @INT4                              | 0.0163@POSIT16<br>0.0337@POSIT8 | 1.91@POSIT16 <sup>2)</sup><br>3.86@POSIT8 <sup>2)</sup>   |
| Energy Efficiency<br>(TOPS/W)  | 1.43-13.7 <sup>1)</sup> | 3.7@FP32<br>36.5@INT8   | 16.2-70.2 <sup>1)</sup> | 14-31.6@BF16 <sup>2)</sup><br>19.5-44@INT8 <sup>2)</sup> | 3.2-16.9 <sup>3)</sup> @FP16<br>51-300 <sup>3)</sup> @INT4 | 0.121@POSIT16<br>0.248@POSIT8   | 38.37@POSIT16 <sup>2)</sup><br>83.23@POSIT8 <sup>2)</sup> |
| Area Efficiency<br>(TOPS/ mm²) | 0.021-1.1 <sup>1)</sup> | 0.02@FP32<br>0.20@INT8  | 0.069-0.071             | NA                                                       | 0.36-2.12 <sup>3)</sup> @INT4                              | 0.0043@POSIT16<br>0.0089@POSIT8 | 1.35@ POSIT16 <sup>2)</sup><br>2.74@ POSIT8 <sup>2)</sup> |

<sup>1)</sup> Evaluated with 90% input sparsity.

<sup>2)</sup> Evaluated with 50% input sparsity.

<sup>3)</sup> From dense models to average of test sparse NN models.

- Background and Motivation
- Challenges of POSIT-Based CIM Macro
- Proposed POSIT@CIM Macro Features
  - Bi-directional Regime Processing Codec
  - Critical-bit Pre-compute-and-store CIM Array
  - Cyclically-alternating Scheduling Adder Tree
- Measurement and Comparison
- Conclusion

#### Conclusion

- An Energy Efficient POSIT-Based CIM Macro
  - Bi-directional Regime Processing Codec
    - ✓ Save Pre-processing Energy by Replacing Codec to Shift-OR
  - Critical-bit Pre-compute-and-store CIM Array
    - ✓ Improve CIM Utilization by Using Spare Bit for Dual-bit MAC
  - Cyclically-alternating Scheduling Adder Tree
    - ✓ Reduce Accumulation Power by Simplifying Addition to OR

A POSIT-Based CIM Macro with Bi-directional Regime Codec, Critical-bit Pre-computing-Storing and Cyclically-alternating Scheduling Achieving 83.23TFOPS/W Energy Efficiency