Welcome > Artificial Intelligence > ML > Songhan > 2020 MCUNet

2020 MCUNet

Q&A

What is the search space?
What is mobile search space?
- ? c42 Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In CVPR, 2019
What is a model? What is the system part and model part in the system-model codesign?
What is one-shot architecture search?
- ? [c4] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In ICML, 2018
- ? [c17] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single Path One-Shot Neural Architecture Search with Uniform Sampling. arXiv, 2019.
Neural Architecture Search (NAS).
- What is the position of NAS in the entire ML system?
- How many categories of Search mehtods in general?
Interpreter-based inference libraries (TF-Lite Micro, CMSIS-NN)
- How this is executed in practice?
Evolution search?
Memory Scheduling?
- Who is doing the scheduling?
What is operation fusion?
Int8 linear quantization?
Hard-swish activation (in MobileNetV3[c23])

References:

NeurIPS’20: MCUNet: Tiny Deep Learning on IoT Devices. By Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, Song Han.
[c6] ICLR’19: ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. By Han Cai, Ligeng Zhu, and Song Han.
[c4] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In ICML, 2018
[c17] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single Path One-Shot Neural Architecture Search with Uniform Sampling. arXiv, 2019.
c42 Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In CVPR, 2019

Motivation

ARM Cortext-M7 MCU: 320KB SRAM, 1MB flash storage.

ResNet-50 exceeds the storage limit by 100x
MobileNetV2 exceeds the peak memory limit by 22x
int8 quantized version of MobileNetV2 still exceeds the memory limit by 5.3x

MCUs are bare-metal devices that do not have an operating system.

Need to jointly design the deep learning model and inference library;

Limited literatures:

Mobile GPU or smartphones have abundant memory and storage.
- They only optimize to reduce FLOPs or latency
- Resulting models cannot fit microcontrollers.
Machine learning on microcontrollers
- either study tiny-scale dataset (e.g., CIFAR or sub-CIFAR level), which is far from real-life use case,
- or use weak neural networks that cannot achieve decent performance.
- Examples: [c15] [c30] [c40] [c28]
Deep learning inference on microcontrollers.
- Most framework rely on an interpreter to interpret the network graph at runtime, which will consume a lot of SRAM and Flash (up to 65% of peak memory) and increase latency by 22%.
- The optimization is performed at layer-level, which fails to utilize the overall network architecture information to further reduce memory usage.
- Examples: TensorFlow Lite Micro[c3], CMSIS-NN[c27], CMix-NN[c8], MicroTVM[c9]

Efficient Neural Network Design

Two ways to improve the performance of deep learning system:
- One is to compress the off-the-shelf networks
- Pruning
- Quantization
- Tensor decomposition
- Another is to design a new network suitable for mobile scenarios
- Neural Architecture Search (NAS) dominates efficient neural network design
Performance of NAS highly depends on the quality of the search space[^c38]
- Traditionally, need to follow manual design heuristics for NAS search space design.
- E.g. Mobile-setting search space ¹ [^c6] [^c45]
  - Originates from MobileNetV2[^c41];
  - Use 224 input resolution and a similar base channel number configurations;
  - Searching for kernel sizes, block depths, and expansion ratios;
For MCUs: no standard model designs nor search space designs for MCUs with limited memory
- It is possible to manually tweak the search space for each MCU, but labor-intensive;
- Need to automatically optimize the search space for tiny and diverse deployment scenarios.

Overview

MCUNet, a system-model co-design framework that enables ImageNet-scale deep learning on microcontrollers.

Efficient neural architecture – TinyNAS, with
- automated search space optimization
- resource constrained model specification.
Lightweight inference engine – TinyEngine
Enabling ImageNet-scale inference on microcontrollers.

SRAM (Read/Write) constrains the activation size; Flash (read only) constrains the model size;

Jointly optimize the deep learning model design (TinyNAS) and the inference library (TinyEngine) to reduce the memory usage;
TinyNAS: two-stage neural architecture search (NAS) method that can handle the tiny and diverse memory constraints on various microcontrollers;
- Performance depends on search space[^c38], and TinyNAS explores the design heuristic at the tiny scale
- First optimize the search space to fit the resource constraints;
  - Generate different search spaces by scaling the input resolution and the model width
  - Collect the computation FLOPs distribution of satisfying networks within the search space to evaluate its priority;
  - Insight: a search space that can accommodate higher FLOPs under memory constraint can produce better model
  - Experiments verfied.
- Then perform the search in optmized space;
TinyEngine:
- code generator-based compilation method to eliminate memory overhead.
  - reduce overhead by 2.7x, improves inference by 22%
- model adaptive memory scheduling:
  - not layer-wise optimization
  - Optimize memory scheduling according to the overall network topology to get a better strategy.
- Perform specialized computation kernel optimization (e.g, loop tiling, loop unrolling, op fusion, etc.) for different layers;

Pushes the limit of deep network performance on microcontrollers.

TinyEngine
- reduces 2.7x memory usage (??? 1-2.7 = -1.7x negitive usage???);
- accelerates the inference by 1.7-3.3x compared to TF-Lite and CMSIS-NN
System-algorithm codesign: TinyNAS + TinyEngine
- ImageNet top-1 accurary 70.2% on microcontroller.
- visual audio wake words: 2.4-3.4x faster; 2.2.-2.6x smaller peak SRAM;
- interactive applications: 10FPS with 91% top-1 accuracy on Speech Commands dataset.

TinyNAS

Two stage neural architecture search approach:

Optimize the search space to fit the resource constraints
- Analyze the computation distribution
- Scale the input resolution and the width multiplier of the mobile search space ¹.
- Select the best search space by analyzing the FLOPs CDF of different search spaces.
- Insight: the design space that is more likely to produce high FLOPs models under the memory constraint gives higher model capacity, thus more likely to achieve high accuracy.
Specialize the network architecture in the optimized search space

TinyEngine

Model adaptive compilation.
- interpreter-based -> generator-based compilation
- only compiles the operations that are used by a given model into the binary.
Memory Scheduling.
- One layer on buffer -> multiple layers on buffer
- Tile the computation loop nests, so that as many columns can fit in that memory as possible.
Specialized kernel optimization. Loops tiling, inner loop unrolling, operation fusion.

Major Techniques

In TinyNAS

Sampling + FLOPs CDF
One-shot neural arch search [c4, c17]
Super network training with weight sharing among sub-networks
Evolution search

In TinyEngine

Code generator-based compilation
Memory scheduling
Loops tiling
Inner loop unrolling
Operation fusion

Evaluation

Datasets: ImageNet, Visual Wake Words(VWW), Speech Commands.
Devices: STM32F746 MCU (320kB SRAM/1MB Flash), STM32H743 (512kB SRAM/2MB Flash). 216MHz CPU.
Youtube Demo
70.2% accuracy, record high, ImageNet recognition on microcontrollers.

Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In CVPR, 2019. ↩

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?