Once-for-all: Train one network and specialize it for efficient deployment


Q&A

  • What is MACs?

  • Resolution?

  • Kernel?

  • Depth?

  • Width? Channel?

    • L1 norm of channel’s weight?
  • Problem formalization

    • ${min_{W_o}{\sum}{_a{_i}}}L_v(C(W_o, a_i))$

References:

The future will be populated with many IoT devices that are AI-capable. AI will surround our lives at much lower cost, lower latency, and higher accuracy. There will be more powerful AI applications running on tiny edge devices, which requires extremely compact models and efficient chips. At the same time, privacy will become increasingly important. On-device AI will be popular thanks to the privacy and latency advantages. Model compression and efficient architecture design techniques will enable on-device AI, making it more capable.

ICLR’20: Once-for-all: Train one network and specialize it for efficient deployment.

  • Decouples model training from neural architectural search.
    • Traing once: train a super network
    • Search multiple targets: each targets are progressively shrinked sub-network of the trained super network.
    • opportunity for algorithm and hardware co-design.

Key Techniques

Progressive Shrinking

  • 4 dimension shrink instead of one-dimension
  • Kernel Size: introduce kernel transformation matrices.
    • sharing kernel weights;
    • separate transformation matrices for different layers;
  • Depth: Original N layer, Keep first D layers, skip the N-D layers;
  • Width: Original num of Channels. Channel sorting by importance.
    • Importance of the channel: L1 norm of a channel’s weight
    • Large L1 norm means more important
    • Smaller subnetworks are initialized with the most important channles.

Evaluation

  • ImageNet;
  • Samsung S7 Edge, Note10, Google Pixel1, Pixel2, LG G8, NVIDIA 1080 Ti, V100 GPUs, Jetson TX2, Intel Xeon CPU, Xilinx ZU8EG, and ZU3EG FPGAs.
  • Cloud Devices:
    • GPU NVIDIA 1080Ti and V100 with Pytorch 1.0+cuDNN.
    • CPU batch size 1 on Intel Xeon E5-2690 v4 + MKL-DNN.
  • Edge Devices:
    • Mobile phones: Samsung, Google and LG phones with TF-Lite, batch size 1;
    • Mobile GPU: Jetson TX2 with Pytorch 1.0 + cuDNN, batch size of 16;
    • Embedded FPGA: Xilinx ZU9EG and ZU3EG FPGAs with Vitis AI, batch size 1. (Inference accelaration)

More

  • Training
  • Q&A ofa_net is always called with pretrained=True, which means it will works without training. But how to train the super network ? References: reference # file: # ofa/model_zoo.py def ofa_net(net_id, pretrained=True): if net_id == 'ofa_proxyless_d234_e346_k357_w1.3': net = OFAProxylessNASNets( dropout_rate=0, width_mult=1.3, ks_list=[3, 5, 7], expand_ratio_list=[3, 4, 6], depth_list=[2, 3, 4], ) elif net_id == 'ofa_mbv3_d234_e346_k357_w1.0': net = OFAMobileNetV3( dropout_rate=0, width_mult=1.0, ks_list=[3, 5, 7], expand_ratio_list=[3, 4, 6], depth_list=[2, 3, 4], ) elif net_id == 'ofa_mbv3_d234_e346_k357_w1.

  • Tutorial
  • References: Hands-on Tutorial of Once-for-All Network, See tutorial/ofa.ipynb How to Get Your Specialized Neural Networks on ImageNet in Minutes With OFA Networks In this notebook, we will demonstrate - how to use pretrained specialized OFA sub-networks for efficient inference on diverse hardware platforms - how to get new specialized neural networks on ImageNet with the OFA network within minutes. Once-for-All (OFA) is an efficient AutoML technique that decouples training from search.

Created Oct 21, 2020 // Last Updated Aug 31, 2021

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?