# **MCUNet: Tiny Deep Learning** on IoT Devices



Ji Lin<sup>1</sup>





Wei-Ming Chen<sup>1,2</sup>

Yujun Lin<sup>1</sup>

<sup>2</sup>National Taiwan University <sup>3</sup>MIT-IBM Watson AI Lab  $^{1}MIT$ 





John Cohn<sup>3</sup>



Chuang Gan<sup>3</sup>



Song Han<sup>1</sup>



NeurIPS 2020 (spotlight)



### **Background: The Era of AloT on Microcontrollers (MCUs)**

Low-cost, low-power







### **Background: The Era of AloT on Microcontrollers (MCUs)**

Low-cost, low-power



#Units (Billion)



### Rapid growth







### **Background: The Era of AloT on Microcontrollers (MCUs)**

Low-cost, low-power



• Wide applications

### Smart Retail



### Personalized Healthcare **Precision Agriculture**





### Rapid growth





### Smart Home





. . .



Memory (Activation)

Storage (Weights)









### **Cloud Al**

Memory (Activation)

Storage (Weights)

16GB

~TB/PB









### **Cloud Al**

Memory (Activation)

Storage (Weights)

16GB

~TB/PB





### **Mobile Al**

4GB

256GB







### **Cloud Al**

Memory (Activation)

Storage (Weights)

16GB

~TB/PB































### **Existing efficient network only reduces model size but NOT activation size!**







### ~70% ImageNet Top-1

| 1.8x |  |
|------|--|
|      |  |

Peak Activation (MB)





































I AN LAS





(a) Search NN model on an existing library e.g., ProxylessNAS, MnasNet







(a) Search NN model on an existing library e.g., ProxylessNAS, MnasNet





(b) Tune deep learning library given a NN model e.g., TVM







(a) Search NN model on an existing library e.g., *ProxylessNAS, MnasNet* 



(c) *MCUNet*: system-algorithm co-design





(b) Tune deep learning library given a NN model e.g., *TVM* 







(a) Search NN model on an existing library e.g., ProxylessNAS, MnasNet



(c) *MCUNet*: system-algorithm co-design





(b) Tune deep learning library given a NN model e.g., TVM





### **TinyNAS: Two-Stage NAS for Tiny Memory Constraints**

Search space design is crucial for NAS performance There is no prior expertise on MCU model design

**Full Network Space** 









### **TinyNAS: Two-Stage NAS for Tiny Memory Constraints**

Search space design is crucial for NAS performance There is no prior expertise on MCU model design







**Optimized Search Space** 





### **TinyNAS: Two-Stage NAS for Tiny Memory Constraints**

Search space design is crucial for NAS performance There is no prior expertise on MCU model design











Revisit ProxylessNAS search space: *S* = *kernel size* × *expansion ratio* × *depth* 



I-IANI\_AI=



Revisit ProxylessNAS search space:

*S* = <u>kernel size</u> × expansion ratio × depth





I-IANI\_AI=



### Revisit ProxylessNAS search space:

 $S = kernel size \times expansion ratio \times depth$ 





I-IANI\_AI=



### Revisit ProxylessNAS search space:

 $S = kernel size \times expansion ratio \times <u>depth</u>$ 









Revisit ProxylessNAS search space: *S* = *kernel size* × *expansion ratio* × *depth* 











Extended search space to cover wide range of hardware capacity:  $S' = kernel size \times expansion ratio \times depth \times input resolution <u>R</u> \times width multiplier <u>W</u>$ 







Extended search space to cover wide range of hardware capacity:  $S' = kernel size \times expansion ratio \times depth \times input resolution <u>R</u> \times width multiplier <u>W</u>$ 

Different *R* and *W* for different hardware capacity (i.e., different optimized sub-space)





*R*=224, *W*=1.0





Extended search space to cover wide range of hardware capacity:  $S' = kernel size \times expansion ratio \times depth \times input resolution <u>R</u> \times width multiplier <u>W</u>$ 

Different *R* and *W* for different hardware capacity (i.e., different optimized sub-space)







\* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR'20

=224, W=1.0





Extended search space to cover wide range of hardware capacity:  $S' = kernel size \times expansion ratio \times depth \times input resolution <u>R</u> \times width multiplier <u>W</u>$ 

Different *R* and *W* for different hardware capacity (i.e., different optimized sub-space)







=224, *W*=1.0





F412/F743/H746/.. 256kB/320kB/512kB/...





Analyzing **FLOPs distribution** of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy







Analyzing **FLOPs distribution** of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy



320kB?









Analyzing **FLOPs distribution** of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy







### 32.5 46.9

Analyzing **FLOPs distribution** of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy







### 32.5 46.9

Analyzing **FLOPs distribution** of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy







### mFLOPs 32.5 46.9

# **TinyNAS: (1) Automated search space optimization**

Analyzing **FLOPs distribution** of satisfying models in each search space: Larger FLOPs -> Larger model capacity -> More likely to give higher accuracy





### mFLOPs 32.5 32.4 39.3 46.9 38.3 46.9 52.0 41.3 31.4 38.4

One-shot NAS through weight sharing 



Small sub-networks are nested in large sub-networks.



\* Cai et al., Once-for-All: Train One Network and Specialize it for Efficient Deployment, ICLR'20



One-shot NAS through weight sharing



Directly evaluate the accuracy of sub-nets







Elastic **Kernel Size** 











Start with **full** kernel size Smaller kernel takes centered weights















Elastic **Kernel Size** 





Shrink the width

Keep the most important channels when shrinking via channel sorting





# **TinyNAS Better Utilizes the Memory**









**TinyNAS** 





# **TinyNAS Better Utilizes the Memory**

### **Peak Memory for First Two Stages**



allowing us to fit a larger model at the same amount of memory



TinyNAS designs networks with more uniform peak memory for each block,





































1. Reducing overhead with separated compilation & runtime



(b) TinyEngine: Model-adaptive code generation.

















2. In-place depth-wise convolution









2. In-place depth-wise convolution











2. In-place depth-wise convolution













2. In-place depth-wise convolution







### Analyzing Million MAC/s improved by each technique









Analyzing Million MAC/s improved by each technique







(1) Code generator-based compilation -> Eliminate overheads of runtime interpretation





- Analyzing Million MAC/s improved by each technique
- (2) Model-adaptive memory scheduling -> Increase data reuse for each layer

  - $M = \max$  (kernel size<sup>2</sup><sub>L</sub>)
  - tiling size of feature map width $_{L_i} =$





(a) Model-level memory scheduling

$$L_i \cdot \text{in channels}_{L_i}; \forall L_i \in \boldsymbol{L}$$

(b) Tile size configuration for Im2col

$$\lfloor M / \left( \text{kernel size}_{L_j}^2 \cdot \text{in channels}_{L_j} \right) \rfloor$$





Analyzing Million MAC/s improved by each technique

(3) Computation Kernel Specialization: Operation fusion

e.g., fuse Pad+Conv+ReLU+BN









Analyzing **Million MAC/s** improved by each technique (3) Computation Kernel Specialization: Loop unrolling



Eliminate the branch instruction overheads of loops











- Analyzing Million MAC/s improved by each technique
- (3) Computation Kernel Specialization: Loop tiling for each layer



















Consistent improvement on different networks  $\bullet$ 

Plif







Consistent improvement on different networks  $\bullet$ 

Plif





## **Experimental Results**

We focus on large-scale datasets to reflect real-life use cases.

### **Datasets:**

- (1) ImageNet-1000
- (2) Wake Words
  - Visual: Visual Wake Words
  - Audio: Google Speech Commands







(a) 'Person'

(b) 'Not-person'









yes











## **System-Algorithm Co-design Gives the Best Results**

ImageNet classification on STM32F746 MCU (**320kB SRAM**, **1MB Flash**) lacksquare



\* scaled down version: width multiplier 0.3, input resolution 80







## System-Algorithm Co-design Gives the Best Results

ImageNet classification on STM32F746 MCU (**320kB SRAM**, **1MB Flash**)  $\bullet$ 

**Baseline** (MbV2\*+CMSIS) **System-only** (MbV2\*+TinyEngine) **Model-only** (TinyNAS+CMSIS)

ImageNet Top1: 35%









## **System-Algorithm Co-design Gives the Best Results**

**Baseline** (MbV2\*+CMSIS) **System-only** (MbV2\*+TinyEngine) **Model-only** (TinyNAS+CMSIS) **Co-design** (TinyNAS+TinyEngine)

ImageNet Top1: 35%

\* scaled down version: width multiplier 0.3, input resolution 80



### • ImageNet classification on STM32F746 MCU (**320kB SRAM**, **1MB Flash**)







# Handling Diverse Hardware

Specializing models (int4) for different MCUs (<u>SRAM</u>/Flash)





### **ImageNet Top-1 Accuracy (%)**





# Handling Diverse Hardware

Specializing models (int4) for different MCUs (<u>SRAM</u>/Flash)





### **ImageNet Top-1 Accuracy (%)**

The first to achieve >70% ImageNet accuracy on **commercial MCUs** 







# Handling Diverse Hardware

• Specializing models (int4) for different MCUs (<u>SRAM</u>/Flash)





### **ImageNet Top-1 Accuracy (%)**







## **Reduce Both Model Size and Activation Size**





~70% ImageNet Top-1

ResNet-18 MobileNetV2-0.75 MCUNet

| 4 0.7 |
|-------|
| 1.8x  |

Peak Activation (MB)





## **Reduce Both Model Size and Activation Size**





~70% ImageNet Top-1

ResNet-18 MobileNetV2-0.75 MCUNet

| _     |  |       |          |
|-------|--|-------|----------|
|       |  |       |          |
| 24.6x |  |       |          |
|       |  | 100.  |          |
|       |  | 13.8> | <b>~</b> |

Peak Activation (MB)





# Visual Wake Words (VWW)







# Visual Wake Words (VWW)







# Visual Wake Words (VWW)







# Audio Wake Words (Speech Commands)











• Detecting whether a person is present in the frame







### 87% accuracy, fps: 7.3





## **MCUNet: Tiny Deep Learning on IoT Devices**



**Cloud Al** 

<u>ResNet</u>

• Our study suggests that the era of tiny machine learning on IoT devices has arrived

Project Page: http://tinyml.mit.edu







