2020 MCUNet


Q&A

  • What is the search space?
  • What is mobile search space?
    • ? c42 Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In CVPR, 2019
  • What is a model? What is the system part and model part in the system-model codesign?
  • What is one-shot architecture search?

    • ? [c4] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In ICML, 2018
    • ? [c17] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single Path One-Shot Neural Architecture Search with Uniform Sampling. arXiv, 2019.
  • Neural Architecture Search (NAS).

    • What is the position of NAS in the entire ML system?
    • How many categories of Search mehtods in general?
  • Interpreter-based inference libraries (TF-Lite Micro, CMSIS-NN)

    • How this is executed in practice?
  • Evolution search?

  • Memory Scheduling?

    • Who is doing the scheduling?
  • What is operation fusion?

  • Int8 linear quantization?

  • Hard-swish activation (in MobileNetV3[c23])


References:

  • NeurIPS’20: MCUNet: Tiny Deep Learning on IoT Devices. By Ji Lin, Wei-Ming Chen, Yujun Lin, John Cohn, Chuang Gan, Song Han.
  • [c6] ICLR’19: ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware. By Han Cai, Ligeng Zhu, and Song Han.
  • [c4] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Understanding and simplifying one-shot architecture search. In ICML, 2018
  • [c17] Zichao Guo, Xiangyu Zhang, Haoyuan Mu, Wen Heng, Zechun Liu, Yichen Wei, and Jian Sun. Single Path One-Shot Neural Architecture Search with Uniform Sampling. arXiv, 2019.
  • c42 Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In CVPR, 2019

Motivation

ARM Cortext-M7 MCU: 320KB SRAM, 1MB flash storage.

  • ResNet-50 exceeds the storage limit by 100x
  • MobileNetV2 exceeds the peak memory limit by 22x
  • int8 quantized version of MobileNetV2 still exceeds the memory limit by 5.3x

MCUs are bare-metal devices that do not have an operating system.

  • Need to jointly design the deep learning model and inference library;

Limited literatures:

  • Mobile GPU or smartphones have abundant memory and storage.
    • They only optimize to reduce FLOPs or latency
    • Resulting models cannot fit microcontrollers.
  • Machine learning on microcontrollers
    • either study tiny-scale dataset (e.g., CIFAR or sub-CIFAR level), which is far from real-life use case,
    • or use weak neural networks that cannot achieve decent performance.
    • Examples: [c15] [c30] [c40] [c28]
  • Deep learning inference on microcontrollers.
    • Most framework rely on an interpreter to interpret the network graph at runtime, which will consume a lot of SRAM and Flash (up to 65% of peak memory) and increase latency by 22%.
    • The optimization is performed at layer-level, which fails to utilize the overall network architecture information to further reduce memory usage.
    • Examples: TensorFlow Lite Micro[c3], CMSIS-NN[c27], CMix-NN[c8], MicroTVM[c9]

Efficient Neural Network Design

  • Two ways to improve the performance of deep learning system:

    • One is to compress the off-the-shelf networks
    • Pruning
    • Quantization
    • Tensor decomposition
    • Another is to design a new network suitable for mobile scenarios
    • Neural Architecture Search (NAS) dominates efficient neural network design
  • Performance of NAS highly depends on the quality of the search space[^c38]

    • Traditionally, need to follow manual design heuristics for NAS search space design.
    • E.g. Mobile-setting search space 1 [^c6] [^c45]
      • Originates from MobileNetV2[^c41];
      • Use 224 input resolution and a similar base channel number configurations;
      • Searching for kernel sizes, block depths, and expansion ratios;
  • For MCUs: no standard model designs nor search space designs for MCUs with limited memory

    • It is possible to manually tweak the search space for each MCU, but labor-intensive;
    • Need to automatically optimize the search space for tiny and diverse deployment scenarios.

Overview

MCUNet, a system-model co-design framework that enables ImageNet-scale deep learning on microcontrollers.

  • Efficient neural architecture – TinyNAS, with

    • automated search space optimization
    • resource constrained model specification.
  • Lightweight inference engine – TinyEngine

  • Enabling ImageNet-scale inference on microcontrollers.

SRAM (Read/Write) constrains the activation size; Flash (read only) constrains the model size;

  • Jointly optimize the deep learning model design (TinyNAS) and the inference library (TinyEngine) to reduce the memory usage;
  • TinyNAS: two-stage neural architecture search (NAS) method that can handle the tiny and diverse memory constraints on various microcontrollers;
    • Performance depends on search space[^c38], and TinyNAS explores the design heuristic at the tiny scale
    • First optimize the search space to fit the resource constraints;
      • Generate different search spaces by scaling the input resolution and the model width
      • Collect the computation FLOPs distribution of satisfying networks within the search space to evaluate its priority;
      • Insight: a search space that can accommodate higher FLOPs under memory constraint can produce better model
      • Experiments verfied.
    • Then perform the search in optmized space;
  • TinyEngine:
    • code generator-based compilation method to eliminate memory overhead.
      • reduce overhead by 2.7x, improves inference by 22%
    • model adaptive memory scheduling:
      • not layer-wise optimization
      • Optimize memory scheduling according to the overall network topology to get a better strategy.
    • Perform specialized computation kernel optimization (e.g, loop tiling, loop unrolling, op fusion, etc.) for different layers;

Pushes the limit of deep network performance on microcontrollers.

  • TinyEngine
    • reduces 2.7x memory usage (??? 1-2.7 = -1.7x negitive usage???);
    • accelerates the inference by 1.7-3.3x compared to TF-Lite and CMSIS-NN
  • System-algorithm codesign: TinyNAS + TinyEngine
    • ImageNet top-1 accurary 70.2% on microcontroller.
    • visual audio wake words: 2.4-3.4x faster; 2.2.-2.6x smaller peak SRAM;
    • interactive applications: 10FPS with 91% top-1 accuracy on Speech Commands dataset.

TinyNAS

Two stage neural architecture search approach:

  • Optimize the search space to fit the resource constraints
    • Analyze the computation distribution
    • Scale the input resolution and the width multiplier of the mobile search space 1.
    • Select the best search space by analyzing the FLOPs CDF of different search spaces.
    • Insight: the design space that is more likely to produce high FLOPs models under the memory constraint gives higher model capacity, thus more likely to achieve high accuracy.
  • Specialize the network architecture in the optimized search space

TinyEngine

  • Model adaptive compilation.
    • interpreter-based -> generator-based compilation
    • only compiles the operations that are used by a given model into the binary.
  • Memory Scheduling.
    • One layer on buffer -> multiple layers on buffer
    • Tile the computation loop nests, so that as many columns can fit in that memory as possible.
  • Specialized kernel optimization. Loops tiling, inner loop unrolling, operation fusion.

Major Techniques

In TinyNAS

  • Sampling + FLOPs CDF
  • One-shot neural arch search [c4, c17]
  • Super network training with weight sharing among sub-networks
  • Evolution search

In TinyEngine

  • Code generator-based compilation
  • Memory scheduling
  • Loops tiling
  • Inner loop unrolling
  • Operation fusion

Evaluation

  • Datasets: ImageNet, Visual Wake Words(VWW), Speech Commands.
  • Devices: STM32F746 MCU (320kB SRAM/1MB Flash), STM32H743 (512kB SRAM/2MB Flash). 216MHz CPU.
  • Youtube Demo
  • 70.2% accuracy, record high, ImageNet recognition on microcontrollers.

More


  1. Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. MnasNet: Platform-Aware Neural Architecture Search for Mobile. In CVPR, 2019. ↩
Created Oct 23, 2020 // Last Updated Aug 31, 2021

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?