Welcome > IOT > Efficient AI with Tiny Resources

Efficient AI with Tiny Resources

References:

Song Han from MIT

Evaluations

NeurIPS’20: MCUNet: Tiny Deep Learning on IoT Devices

Datasets: ImageNet, Visual Wake Words(VWW), Speech Commands.
Devices: STM32F746 MCU (320kB SRAM/1MB Flash), STM32H743 (512kB SRAM/2MB Flash). 216MHz CPU.
Youtube Demo
70.2% accuracy, record high, ImageNet recognition on microcontrollers.

NeurIPS’20: Tiny Transfer Learning: Towards Memory-Efficient On-Device Learning

Three benchmark datasets: Cars, Flowers, Aircraft
- Using ImageNet as the pre-training dataset.
- Neural network architecture: MobileNetV2 (lightweight), ResNet-50.
Devices: Raspberry Pi 1. 256MB of memory.

NeurIPS’20: Differentiable Augmentation for Data-Efficient GAN Training

Model: BigGAN, CR-BigGAN, StyleGAN2
Datasets: ImageNet (128x128 resolution), FFHQ portait dataset (256x256), image generated by few-shot learning.
Devices: Not in paper. (General platform?)
Code: Data-efficient-gans

ICLR’20: Once-for-all: Train one network and specialize it for efficient deployment.

ImageNet;
Samsung S7 Edge, Note10, Google Pixel1, Pixel2, LG G8, NVIDIA 1080 Ti, V100 GPUs, Jetson TX2, Intel Xeon CPU, Xilinx ZU8EG, and ZU3EG FPGAs.
Cloud Devices:
- GPU NVIDIA 1080Ti and V100 with Pytorch 1.0+cuDNN.
- CPU batch size 1 on Intel Xeon E5-2690 v4 + MKL-DNN.
Edge Devices:
- Mobile phones: Samsung, Google and LG phones with TF-Lite, batch size 1;
- Mobile GPU: Jetson TX2 with Pytorch 1.0 + cuDNN, batch size of 16;
- Embedded FPGA: Xilinx ZU9EG and ZU3EG FPGAs with Vitis AI, batch size 1. (Inference accelaration)
- Xilinx ZU9EG (ZCU102, $2495)
- Xilinx ZU3EG ()

ECCV’20: Searching Efficient 3D Architectures with Sparse Point-Voxel Convolution.

ECCV’20: DataMix: Efficient Privacy-Preserving Edge-Cloud Inference.

ACL’20: HAT: Hardware-Aware Transformers for Efficient Natural Language Processing.

CVPR’20: GAN Compression: Efficient Architectures for Interactive Conditional GANs

Three GAN (Generative Adversarial Network) Models
- CycleGAN
- Pix2pix
- GauGAN
Four datasets:
- Edges->shoes. Used in pix2pix.
- Cityscapes. Used in pix2pix and GauGAN.
- Horse <-> zebra. Used in CycleGAN.
- Map <-> aerial photo. Used in pix2pix.
CPU or NVIDIA GPU + CUDA CuDNN
- Mobile device (NVIDIA Jetson Nano GPU). Youtube Demo
- github: https://github.com/mit-han-lab/gan-compression

CVPR’20: APQ: Joint Search for Network Architecture, Pruning and Quantization Policy.

DAC’20: GCN-RL Circuit Designer: Transferable Transistor Sizing with Graph Neural Networks and Reinforcement Learning.

HPCA’20: SpArch: Efficient Architecture for Sparse Matrix Multiplication.

ICLR’20: Lite Transformer with Long Short Term Attention.

NeurIPS’19: Point Voxel CNN for Efficient 3D Deep Learning.

NeurIPS’19: Deep Leakage from Gradients.

ICCV’19: TSM: Temporal Shift Module for Efficient Video Understanding.

CVPR’19: HAQ: Hardware-Aware Automated Quantization with Mixed Precision.

ICLR’19: ProxylessNAS: Direct Neural Architecture Search on Target Task and Hardware.

ICLR’19: Defensive Quantization: When Efficiency Meets Robustness.

ECCV’18: AMC: AutoML for Model Compression and Acceleration on Mobile Devices.

ICLR’18: Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training.

ICLR’18: Efficient Sparse-winograd Convolutional Neural Networks.

ICLR’17: DSD: Dense-Sparse-Dense Training for Deep Neural Networks.

ICLR’17: Trained Tenary Quantization.

HotChips at MICRO’17: Software-Hardware Co-Design for Efficient Neural Network Acceleration.

Devices: FPGA
- Xilinx XCKU060, 200MHz
- CPU: Intel Core i7-5930K
- GPU: Pascal TitanX GPU.

EMDNN’16 FPGA’17: ESE: Efficient Speech Recognition Engine for Sparse LSTM on FPGA.

Devices: FPGA
- Xilinx XCKU060, 200MHz
- Intel Core i7 5930K
- Pascal Titan X GPU

O’Reilly, 2016: Compressing and Regularizing Deep Neural Networks, Improving Prediction Accuracy Using Deep Compression and DSD Training.

ISCA’16: EIE: Efficient Inference Engine on Compressed Deep Neural Network.

Model: Caffe, NerualTalk, AlexNet, VGGNet, ImageNet
Devices:
- Simulator, implemented in C++, cycle accurate
- Also implemented in RTL in Verilog
- for area, power, and critical path delay.
- Use Synopsys Design Compiler (DC) under TSMC 45nm GP standard VT lib.
- Use Cacti to get SRAM area and energy numbers.
- Baseline: CPU, GPU and mobile GPU.
- Intel Core i7 5930K
- Nvidia GeForce GTX Titan X GPU
- Nvidia Tegra K1 mobile GPU. 192 CUDA cores.

ICLR’16: Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding.

Model/Platform: Caffe framework;
- Lenet-300-100, Lenet-5 on MNIST
- AlexNet, VGG-16 on ImageNet
Devices: CPU, GPU, mobile GPU.
- Nvidia GeForce GTX Titan X
- Intel Core i7 5930K as desktop processor.
- Nvidia Tegra K1 as mobile processor.

NIPS’15: Learning both Weights and Connections for Efficient

Model/Platform: Caffe; Four networks:
- Lenet-300-100, Lenet-5 on MNIST
- AlexNext and VGG on ImageNet.
Devices:
- Nvidia TitanX
- GTX980 GPUs.

ArXiv’16: SqueezeNet: AlexNet-Level Accuracy with 50x Fewer Parameters and <0.5MB Model Size.

Devices: general?
Code at github

ISVLSI’16: Angel-Eye: A Complete Design Flow for Mapping CNN onto Customized Hardware.

Model: face detection
Devices: FPGA, power & performance
- Xilinx XC7Z045 (16-bit data bitwidth)
- Xilinx XC7Z020 (8-bit data bidwidth)
- For comparison: N

ICLR Workshop’16: Hardware-friendly Convolutional Neural Network with Even-number Filter Size., 4 pages, with Tsinghua Brain.

MXnet (Chen et al. 2015)
Devices:
- Intel Xeon E5-2690 CPUs@2.9GHz,
- 2 Nvidia TITAN X GPUs.

If you could revise
the fundmental principles of
computer system design
to improve security...

... what would you change?

Efficient AI with Tiny Resources

Evaluations

More