2020 MCUNet
Q&A
- What is the search space?
- What is mobile search space?
- ? c42 Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In CVPR, 2019
- What is a model? What is the
system
part and model
part in the system-model
codesign?
What is one-shot architecture search?
- ? [c4] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In ICML, 2018
- ? [c17] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single Path One-Shot Neural Architecture Search with Uniform Sampling. arXiv, 2019.
Neural Architecture Search (NAS).
- What is the position of NAS in the entire ML system?
- How many categories of Search mehtods in general?
Interpreter-based inference libraries (TF-Lite Micro, CMSIS-NN)
- How this is executed in practice?
Evolution search?
Memory Scheduling?
- Who is doing the scheduling?
What is operation fusion?
Int8 linear quantization?
Hard-swish activation (in MobileNetV3[c23])
References:
- NeurIPS’20: MCUNet: Tiny Deep Learning on IoT Devices. By Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, Song Han.
- [c6] ICLR’19: ProxylessNAS: Direct Neural Architecture Search on Target Task
and Hardware. By Han Cai, Ligeng Zhu, and Song Han.
- [c4] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In ICML, 2018
- [c17] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single Path One-Shot Neural Architecture Search with Uniform Sampling. arXiv, 2019.
- c42 Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In CVPR, 2019
Motivation
ARM Cortext-M7 MCU: 320KB SRAM, 1MB flash storage.
- ResNet-50 exceeds the storage limit by 100x
- MobileNetV2 exceeds the peak memory limit by 22x
- int8 quantized version of MobileNetV2 still exceeds the memory limit by 5.3x
MCUs are bare-metal devices that do not have an operating system.
- Need to jointly design the deep learning model and inference library;
Limited literatures:
- Mobile GPU or smartphones have abundant memory and storage.
- They only optimize to reduce FLOPs or latency
- Resulting models cannot fit microcontrollers.
- Machine learning on microcontrollers
- either study tiny-scale dataset (e.g., CIFAR or sub-CIFAR level), which is far from real-life use case,
- or use weak neural networks that cannot achieve decent performance.
- Examples: [c15] [c30] [c40] [c28]
- Deep learning inference on microcontrollers.
- Most framework rely on an interpreter to interpret the network graph at runtime, which will consume a lot of SRAM and Flash (up to 65% of peak memory) and increase latency by 22%.
- The optimization is performed at layer-level, which fails to utilize the overall network architecture information to further reduce memory usage.
- Examples: TensorFlow Lite Micro[c3], CMSIS-NN[c27], CMix-NN[c8], MicroTVM[c9]
Efficient Neural Network Design
Two ways to improve the performance of deep learning system:
- One is to compress the off-the-shelf networks
- Pruning
- Quantization
- Tensor decomposition
- Another is to design a new network suitable for mobile scenarios
- Neural Architecture Search (NAS) dominates efficient neural network design
Performance of NAS highly depends on the quality of the search space[^c38]
- Traditionally, need to follow manual design heuristics for NAS search space design.
- E.g. Mobile-setting search space [^c6] [^c45]
- Originates from MobileNetV2[^c41];
- Use 224 input resolution and a similar base channel number configurations;
- Searching for kernel sizes, block depths, and expansion ratios;
For MCUs: no standard model designs nor search space designs for MCUs with limited memory
- It is possible to manually tweak the search space for each MCU, but labor-intensive;
- Need to automatically optimize the search space for tiny and diverse deployment scenarios.
Overview
MCUNet, a system-model co-design framework that enables ImageNet-scale deep learning on microcontrollers.
Efficient neural architecture – TinyNAS, with
- automated search space optimization
- resource constrained model specification.
Lightweight inference engine – TinyEngine
Enabling ImageNet-scale inference on microcontrollers.
SRAM (Read/Write) constrains the activation size; Flash (read only) constrains the model size;
- Jointly optimize the deep learning model design (TinyNAS) and the inference library (TinyEngine) to reduce the memory usage;
- TinyNAS: two-stage neural architecture search (NAS) method that can handle the tiny and diverse memory constraints on various microcontrollers;
- Performance depends on search space[^c38], and TinyNAS explores the design heuristic at the tiny scale
- First optimize the search space to fit the resource constraints;
- Generate different search spaces by scaling the input resolution and the model width
- Collect the computation FLOPs distribution of satisfying networks within the search space to evaluate its priority;
- Insight: a search space that can accommodate higher FLOPs under memory constraint can produce better model
- Experiments verfied.
- Then perform the search in optmized space;
- TinyEngine:
- code generator-based compilation method to eliminate memory overhead.
- reduce overhead by 2.7x, improves inference by 22%
- model adaptive memory scheduling:
- not layer-wise optimization
- Optimize memory scheduling according to the overall network topology to get a better strategy.
- Perform specialized computation kernel optimization (e.g, loop tiling, loop unrolling, op fusion, etc.) for different layers;
Pushes the limit of deep network performance on microcontrollers.
- TinyEngine
- reduces 2.7x memory usage (??? 1-2.7 = -1.7x negitive usage???);
- accelerates the inference by 1.7-3.3x compared to TF-Lite and CMSIS-NN
- System-algorithm codesign: TinyNAS + TinyEngine
- ImageNet top-1 accurary 70.2% on microcontroller.
- visual audio wake words: 2.4-3.4x faster; 2.2.-2.6x smaller peak SRAM;
- interactive applications: 10FPS with 91% top-1 accuracy on Speech Commands dataset.
TinyNAS
Two stage neural architecture search approach:
- Optimize the search space to fit the resource constraints
- Analyze the computation distribution
- Scale the input resolution and the width multiplier of the mobile search space .
- Select the best search space by analyzing the FLOPs CDF of different search spaces.
- Insight: the design space that is more likely to produce high FLOPs models under the memory constraint gives higher model capacity, thus more likely to achieve high accuracy.
- Specialize the network architecture in the optimized search space
TinyEngine
- Model adaptive compilation.
- interpreter-based -> generator-based compilation
- only compiles the operations that are used by a given model into the binary.
- Memory Scheduling.
- One layer on buffer -> multiple layers on buffer
- Tile the computation loop nests, so that as many columns can fit in that memory as possible.
- Specialized kernel optimization. Loops tiling, inner loop unrolling, operation fusion.
Major Techniques
In TinyNAS
- Sampling + FLOPs CDF
- One-shot neural arch search [c4, c17]
- Super network training with weight sharing among sub-networks
- Evolution search
In TinyEngine
- Code generator-based compilation
- Memory scheduling
- Loops tiling
- Inner loop unrolling
- Operation fusion
Evaluation
- Datasets: ImageNet, Visual Wake Words(VWW), Speech Commands.
- Devices: STM32F746 MCU (320kB SRAM/1MB Flash), STM32H743 (512kB SRAM/2MB Flash). 216MHz CPU.
- Youtube Demo
- 70.2% accuracy, record high, ImageNet recognition on microcontrollers.
More