Deep Learning Inferences on Embedded Platforms

Slot: AC Tue and Fri 2-3:30 pm

Location: Bharti 106

Evaluation plan: 10% minor1, 5% one-to-one session, 15% presentation, 35% course project, 35% end sem exam

[1] Background

If all mobiles are connected to the Internet, why not run all heavy computations on the cloud anyway? What is the point of running deep learning inferences on mobile or embedded platforms? We will discuss some motivating examples, where network connectivity price, latency or energy might make local computations on mobile devices more useful compared to remote execution on the cloud. We will also spend some time to understand what exactly is run on the mobile in typical usage scenarios (inference tasks using pre-trained models), and the deepnet layer details for some such typical computations.

[2] Metrics and trade-offs

While accuracy of the inference task is an important metric to maximize, this might have trade-offs with other metrics on resource constrained embedded platforms. Is the latency of each inference too high to suit a real time mobile application while a user is interacting with it, or to suit a road traffic application to detect/prevent accidents? Is the trained deep-net model used in the inference too large to fit the embedded platform RAM? Does the inference task drain the mobile battery too fast? We will discuss such metrics like accuracy, latency, memory and power requirements, and the trade-offs among them. The goal of the course is to see how different research communities are innovating to better handle these trade-offs. [tradeoff] [deepiot]

[3] New Hardware Architectures

If proliferation of extensive datasets enabled deepnets to learn from examples, hardware advances like GPU have been an equally important enabling factor. We will discuss how computer architecture researchers are devising new architectural designs for embedded deepnets. This changes the hardware platform on which the inference tasks execute. Three main concepts to be discussed in this section are (i) how to efficiently store and access sparse matrices of the DNN from memory, (ii) how to split hardware resources like compute and memory elements into small units or Processing Engines (PE), that can process parts of a DNN in parallel and (iii) how to design dataflows or the order in which processing is done to ensure maximum data reuse for minimum latency/energy exploitation of the memory hierarchy (off-chip DRAM, on-chip SRAM, PE interconnects, registers ....). [eyeriss][eie] [scnn] [survey]

[4] Systems

Mobile systems researchers create a software interface between the architecture researchers who design the actual hardware on which inference tasks are run, and the ML researchers who design what computations each inference task would need. We will discuss traditional systems optimization techniques like scheduling (e.g. (i) pipelining different inference tasks to reduce latency, (ii) spread computations across CPU-GPU and other co-processors on the mobile platform and (iii) use cloud computing when network is available), caching (e.g. store reusable results to reduce computations) etc. in the context of embedded deepnets. [deepmon] [deepeye] [deepx] [mcdnn] [leo]

[5] Neural Network Compression

Almost all embedded deep net practitioners are exploiting the redundancies present in a trained DNN, to sparsify and compress it. This reduces the size of the DNN, and therefore the storage requirements of the model in disk and RAM during embedded deployment come down. Computations might be reduced, and with careful compression, latency and energy will reduce too. The challenge is to do this while maintaining reasonable inference accuracyi, and also low DNN re-training overhead. The main ideas to be discussed are (i) matrix and tensor factorization methods to make DNN feature matrics sparse while retaining the most informative elements, (ii) pruning some DNN connections based on low magnitude of feature weights or removing connections that empirically minimize latency and energy and (iii) different quantization mechanisms where either the bit representation of the features is reduced by using a different numerical representation (floating point, fixed point, binary, ternary) or the number of unique weight values are reduced (using K-Means clustering or hashing) or weights are encoded more efficiently (Huffman encoding). [sparsification] [pruning-magnitude] [pruning-architecture] [pruning-systematic] [pruning-energy] [quantization-fixedpoint-minifloat] [quantization-8bitINT] [clustering-huffman]

[6] Learning Smaller Networks

Machine learning researchers design the actual computations needed in an inference task. There are significant efforts towards alternate network architectures that are more efiicient by design and use a lot of domain knowledge to work well for specific embedded tasks. [deeprebirth] [shufflenet] [mobilenet] [squeezenet] [deepear] [protonn] [bonsai]
Topic Slides
Course motivation and overview [lecture-1]
Deep learning inference background [lecture-2], [lecture-3], [lecture-4], [lecture-5]
Metrics and trade-offs [lecture-6]
New Hardware Architectures [lecture-7], [lecture-8], [lecture-9] [lecture-10]
System Optimizations [lecture-11]
Neural Network Compression [lecture-12] [lecture-13] [lecture-14]
Learning Smaller Networks [lecture-15]
Course summary

Minor 1

All demo applications for some existing DNN frameworks for mobiles to be run on an Android device. A report of hardware details of the Android device, issues in running any framework and their fixes (if any) to be submitted. Possible mobile deep learning frameworks:

Course project

Each student is designing, implementing and evaluating an embedded DNN system, based on their research interests. The deliverables are a demo of the working system, a github repo with all the sources and a report. The due date is May 10.