Efficient AI with Tiny Resources

References:

Evaluations

NeurIPS’20: MCUNet: Tiny Deep Learning on IoT Devices

  • Datasets: ImageNet, Visual Wake Words(VWW), Speech Commands.
  • Devices: STM32F746 MCU (320kB SRAM/1MB Flash), STM32H743 (512kB SRAM/2MB Flash). 216MHz CPU.
  • Youtube Demo
  • 70.2% accuracy, record high, ImageNet recognition on microcontrollers.

NeurIPS’20: Tiny Transfer Learning: Towards Memory-Efficient On-Device Learning

  • Three benchmark datasets: Cars, Flowers, Aircraft

    • Using ImageNet as the pre-training dataset.
    • Neural network architecture: MobileNetV2 (lightweight), ResNet-50.
  • Devices: Raspberry Pi 1. 256MB of memory.

NeurIPS’20: Differentiable Augmentation for Data-Efficient GAN Training

  • Model: BigGAN, CR-BigGAN, StyleGAN2
  • Datasets: ImageNet (128x128 resolution), FFHQ portait dataset (256x256), image generated by few-shot learning.
  • Devices: Not in paper. (General platform?)
  • Code: Data-efficient-gans

ICLR’20: Once-for-all: Train one network and specialize it for efficient deployment.

  • ImageNet;
  • Samsung S7 Edge, Note10, Google Pixel1, Pixel2, LG G8, NVIDIA 1080 Ti, V100 GPUs, Jetson TX2, Intel Xeon CPU, Xilinx ZU8EG, and ZU3EG FPGAs.
  • Cloud Devices:
    • GPU NVIDIA 1080Ti and V100 with Pytorch 1.0+cuDNN.
    • CPU batch size 1 on Intel Xeon E5-2690 v4 + MKL-DNN.
  • Edge Devices:
    • Mobile phones: Samsung, Google and LG phones with TF-Lite, batch size 1;
    • Mobile GPU: Jetson TX2 with Pytorch 1.0 + cuDNN, batch size of 16;
    • Embedded FPGA: Xilinx ZU9EG and ZU3EG FPGAs with Vitis AI, batch size 1. (Inference accelaration)
    • Xilinx ZU9EG (ZCU102, $2495)
    • Xilinx ZU3EG ()

ECCV’20: Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution.

ECCV’20: DataMix: Efficient Privacy-Preserving Edge-Cloud Inference.

ACL’20: HAT: Hardware-Aware Transformers for Efficient Natural Language Processing.

CVPR’20: GAN Compression: Efficient Architectures for Interactive Conditional GANs

  • Three GAN (Generative Adversarial Network) Models
    • CycleGAN
    • Pix2pix
    • GauGAN
  • Four datasets:

    • Edges->shoes. Used in pix2pix.
    • Cityscapes. Used in pix2pix and GauGAN.
    • Horse <-> zebra. Used in CycleGAN.
    • Map <-> aerial photo. Used in pix2pix.
  • CPU or NVIDIA GPU + CUDA CuDNN

CVPR’20: APQ: Joint Search for Network Architecture, Pruning and Quantization Policy.

DAC’20: GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning.

HPCA’20: SpArch: Efficient Architecture for Sparse Matrix Multiplication.

ICLR’20: Lite Transformer with Long Short Term Attention.

NeurIPS’19: Point Voxel CNN for Efficient 3D Deep Learning.

NeurIPS’19: Deep Leakage from Gradients.

ICCV’19: TSM: Temporal Shift Module for Efficient Video Understanding.

CVPR’19: HAQ: Hardware-Aware Automated Quantization with Mixed Precision.

ICLR’19: ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware.

ICLR’19: Defensive Quantization: When Efficiency Meets Robustness.

ECCV’18: AMC: AutoML for Model Compression and Acceleration on Mobile Devices.


ICLR’18: Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training.

ICLR’18: Efficient Sparse-winograd Convolutional Neural Networks.

ICLR’17: DSD: Dense-Sparse-Dense Training for Deep Neural Networks.

ICLR’17: Trained Tenary Quantization.

HotChips at MICRO’17: Software-Hardware Co-Design for Efficient Neural Network Acceleration.

  • Devices: FPGA
    • Xilinx XCKU060, 200MHz
    • CPU: Intel Core i7-5930K
    • GPU: Pascal TitanX GPU.

EMDNN’16 FPGA’17: ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA.

  • Devices: FPGA
    • Xilinx XCKU060, 200MHz
    • Intel Core i7 5930K
    • Pascal Titan X GPU

O’Reilly, 2016: Compressing and Regularizing Deep Neural Networks, Improving Prediction Accuracy Using Deep Compression and DSD Training.


ISCA’16: EIE: Efficient Inference Engine on Compressed Deep Neural Network.

  • Model: Caffe, NerualTalk, AlexNet, VGGNet, ImageNet
  • Devices:
    • Simulator, implemented in C++, cycle accurate
    • Also implemented in RTL in Verilog
    • for area, power, and critical path delay.
    • Use Synopsys Design Compiler (DC) under TSMC 45nm GP standard VT lib.
    • Use Cacti to get SRAM area and energy numbers.
    • Baseline: CPU, GPU and mobile GPU.
    • Intel Core i7 5930K
    • Nvidia GeForce GTX Titan X GPU
    • Nvidia Tegra K1 mobile GPU. 192 CUDA cores.

ICLR’16: Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.

  • Model/Platform: Caffe framework;
    • Lenet-300-100, Lenet-5 on MNIST
    • AlexNet, VGG-16 on ImageNet
  • Devices: CPU, GPU, mobile GPU.
    • Nvidia GeForce GTX Titan X
    • Intel Core i7 5930K as desktop processor.
    • Nvidia Tegra K1 as mobile processor.

NIPS’15: Learning both Weights and Connections for Efficient

  • Model/Platform: Caffe; Four networks:
    • Lenet-300-100, Lenet-5 on MNIST
    • AlexNext and VGG on ImageNet.
  • Devices:
    • Nvidia TitanX
    • GTX980 GPUs.

ArXiv’16: SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size.

  • Devices: general?
  • Code at github

ISVLSI’16: Angel-Eye: A Complete Design Flow for Mapping CNN onto Customized Hardware.

  • Model: face detection
  • Devices: FPGA, power & performance
    • Xilinx XC7Z045 (16-bit data bitwidth)
    • Xilinx XC7Z020 (8-bit data bidwidth)
    • For comparison: N

ICLR Workshop’16: Hardware-friendly Convolutional Neural Network with Even-number Filter Size., 4 pages, with Tsinghua Brain.

  • MXnet (Chen et al. 2015)
  • Devices:
    • Intel Xeon E5-2690 CPUs@2.9GHz,
    • 2 Nvidia TITAN X GPUs.

More

Created Oct 20, 2020 // Last Updated Oct 22, 2020

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?