Slot: AC Tue and Fri 2-3:30 pm
Location: Bharti 106
Evaluation plan: 10% minor1, 5% one-to-one session, 15% presentation, 35% course project, 35% end sem exam
If all mobiles are connected to the Internet, why not run all heavy computations on the cloud anyway? What is the point of running deep learning inferences on mobile or embedded platforms? We will discuss some motivating examples, where network connectivity price, latency or energy might make local computations on mobile devices more useful compared to remote execution on the cloud. We will also spend some time to understand what exactly is run on the mobile in typical usage scenarios (inference tasks using pre-trained models), and the deepnet layer details for some such typical computations.
While accuracy of the inference task is an important metric to maximize, this might have trade-offs with other metrics on resource constrained embedded platforms. Is the latency of each inference too high to suit a real time mobile application while a user is interacting with it, or to suit a road traffic application to detect/prevent accidents? Is the trained deep-net model used in the inference too large to fit the embedded platform RAM? Does the inference task drain the mobile battery too fast? We will discuss such metrics like accuracy, latency, memory and power requirements, and the trade-offs among them. The goal of the course is to see how different research communities are innovating to better handle these trade-offs. [tradeoff] [deepiot]
If proliferation of extensive datasets enabled deepnets to learn from examples, hardware advances like GPU have been an equally important enabling factor. We will discuss how computer architecture researchers are devising new architectural designs for embedded deepnets. This changes the hardware platform on which the inference tasks execute. Three main concepts to be discussed in this section are (i) how to efficiently store and access sparse matrices of the DNN from memory, (ii) how to split hardware resources like compute and memory elements into small units or Processing Engines (PE), that can process parts of a DNN in parallel and (iii) how to design dataflows or the order in which processing is done to ensure maximum data reuse for minimum latency/energy exploitation of the memory hierarchy (off-chip DRAM, on-chip SRAM, PE interconnects, registers ....). [eyeriss][eie] [scnn] [survey]
Topic | Slides |
---|---|
Course motivation and overview | [lecture-1] |
Deep learning inference background | [lecture-2], [lecture-3], [lecture-4], [lecture-5] |
Metrics and trade-offs | [lecture-6] |
New Hardware Architectures | [lecture-7], [lecture-8], [lecture-9] [lecture-10] |
System Optimizations | [lecture-11] |
Neural Network Compression | [lecture-12] [lecture-13] [lecture-14] |
Learning Smaller Networks | [lecture-15] |
Course summary |
All demo applications for some existing DNN frameworks for mobiles to be run on an Android device. A report of hardware details of the Android device, issues in running any framework and their fixes (if any) to be submitted. Possible mobile deep learning frameworks:
Each student is designing, implementing and evaluating an embedded DNN system, based on their research interests. The deliverables are a demo of the working system, a github repo with all the sources and a report. The due date is May 10.