Classes in AA slot. Mondays and Thursdays 2-3:30 pm. LHC519.

Course Content

NOTE: Topics below are tentative (until we are past that week). We will update it as we go through the lectures in each week.
Lecture 1: ML inferences overview MLinferences_overview.pdf

Lecture 2: ML Model Architecture changes to reduce computations and storage: (i) Original computations and storage needed in basic CNN forward pass cnn-basics.pdf, (ii) 1x1 bottleneck layer inception-1x1-bottlenecklayer.pdf, Andrew NG video, (iii) depthwise separable convolutions mobilenet-depthwise-separable-convolution.pdf

Lecture 3: ML model compression technqiues: (i) Quantization quantization.pdf, (ii) Pruning pruning.pdf

Lectures 4-5: The difference between Gen1 and Gen2 NN compilers NNcompilers.pdf, Gen2 NN compiler which optimizes computations to utilize target hardware capabilities ANSOR_slides.pdf

Lectures 6-7: Runtime management in an inference serving system, like scheduling and load balancing NNruntime.pdf
Additional online resources on ML model compression: tensorflow, tensorflowlite, ristretto, pytorch, STM32, nvidia tensorrt, tensorrt sample script
Necessary readings for quiz1:
(i) Mobilenet paper, (ii) Scalpel paper , (iii) Compiler based optimizations of inference computations for target hardware with Ansor (iv) Runtime management of inference servers with Nexus
Aug 18

Transformer and LLM background: slides_aug25.pdf, jupyter_aug25.ipynb based on the online tutorial (prelude and part 3 on transformer) https://medium.com/@bradneysmith/98a1320e7650 Other online resources: Next word prediction task video1, video2, video3, video4
karpathy_video1, with git repo, karpathy_video2
Translation with transformer model umarjamil_video1, Coding transformer from scratch umarjamil_video2, LLama architecture and features umarjamil_video3, coding LLama from scratch umarjamil_video4
Flash Attention Rewriting the attention kernels, to use the Nvidia GPU hardware optimally. In Flashattention-1, smaller, faster SRAM is used as memory for attention computation instead of larger, slower HBM. To fit SRAM, computation is done in tiles. Main innovation is algebraic, showing tiled softmax is exactly same as non-tiled softmax. Fusing (matmul-softmax-matmul) operations in a single kernel, intermediate N*N outputs of attention scores S and Probability P, are not wriiten out/read from HBM. Smaller stats are saved in forward pass, enough for gradient to be **recomputed** in backward pass. Flashattention 2 and 3 optimizes how this tiled kernel is parallelized across thread blocks and warps, to better utilize upcoming hardware features like TMA, and tensor cores.

Original papers by Tri Dao et. FlashAttention neurips22, FlashAttention-2 iclr24, FlashAttention-3 neurips24. Slides used in class slides1, slides2. Arunabh notes on flashattention-1 notes1, Sarthak notes on flashattention-2 notes2, notes3, additional flashattention-1 notes washington.

Background material on Hopper/Blackwell GPU architectural extensions used in Flashattention-3 cuda-programming-guide sections 10.26 (warp specialization for producer consumer) and 10.29 (Tensor Memory Accelerator). Jacobian calculation for softmax softmax.pdf.

POD Attention This also rewrites the attention kernels, to better use the Nvidia GPU hardware. Each SM has known number of hardware units to process load/store instructions (useful for decode stage of LLM) and compute instructions (useful for pre-fill stage of LLM). But the CTA and warp schedulers are black boxes in the NVIDIA driver, not allowing the programmer to schedule prefill-decode across LLM inference requests to balance the use within the SM hardware. Pod-attention is a beautiful paper, circumventing the blackbox driver, using the same kernel for prefix-decode, but within the kernel deciding between prefill vs. decode, depending on that particular SM's hardware utilization. paper, slides
Youtube videos: Flashattention-1, Flashattention-2, Flashattention-3-jayshah, Flashattention-3-tridao, Podattention
Necessary readings for quiz2 (i) Chapters 3, 4, 5 from cudabook, (ii) How to optimally parallelize histogram (iii) Flash Attention 1 paper, (iv) Flash Attention 2 paper, (v) Flash Attention 3 paper, (vi) podAttention paper, (vii) pagedAttention paper, (viii) vAttention paper, Sep 25 2:15-3:15 pm

Major exam syllabus

Pre-deployment LLM model compression
1. Sparsification slides
2. Quantization slides. Papers: Use the slides as pointers, to read specific parts of the papers. zeroquant.pdf, llmint8.pdf, gptq.pdf, smoothquant.pdf, duquant.pdf, clippedsoftmax.pdf
3. Delta compression for fine-tuned models
Use the slides as pointers, to read specific parts of the papers. Only deltazip paper and code are included in the major syllabus.
a. bitdelta neurips24_bitdelta.pdf, bitdelta_code
b. deltacome neurips24_deltacome.pdf, deltacome_code
c. deltazip eurosys25_deltazip.pdf, deltazip_code
d. ultradeltaneurips25_ultradelta.pdf, ultradelta_code
Runtime KV cache optimization using paging Using paging principles from OS, PagedAttention paper reduced internal and external fragmentation of KV cache memory usage. Their solution was implemented in userspace, which needed attention kernels to be re-written to handle non-contiguous virtual memory accesses. Followup work vAttention paper used cuda virtual memory management APIs, so that virtual memory could be contiguous, and attention kernels could much easily use paging. Slides pagedAttention_slides.pdf, vAttention_slides.pdf.

Not in major syllabus. Another direction in KV cache optimization, especially to handle very long sequence lengths, has been to introduce strutcured sparsity. Systems paper FlashInfer paper handles these sparse attention kernel computations, using block sparse matrices.
Youtube videos: pagedAttention till 24:33 minutes.
Runtime resource provisioning and scheduling to handle millions of incoming client requets for LLMs at inference servers

1. Orca Fine-grained scheduling, at iteration level, instead of coarse grained scheduling at request level. Selective batching of reqeusts for non-attention operations, while attention operations are non-batched. Mixed batching of prefill stages of new requests and decode stage of old requests.2022_osdi-orca.pdf

2. Sarathi-serve Fine-grained scheduling and mixed batching like Orca, but prefill stage is capped at fixed chunk lengths of incoming request tokens. Ongoing decode stages of requests are prioritized in a batch, followed by ongoing prefills of reqeusts. If more tokens can be accommodated withing batch budget, then prefills for new incoming requests are accommodated in the batch. The token budget, if too large, creates long prefills affecting TBT of decodes. The token budget, if too small, chunks prefills for a single request into many batches, with additional communication overhead of the KV caches between those batches. Paper: 2024_osdi-sarathiserve.pdf, Code: sarathi-serve

3. Distserve, Splitwise The prefill stage is compute bound and even batching 1-2 requests can fully utilize GPU compute. The decode stage is memory bound and large batch sizes of 64-128 can properly utilize the GPU compute,after which memory becomes bottleneck. As the two stages have very different characteristics, these two papers propose to use different GPU servers in the cluster for prefill stage and decode stage. The new overhead is transferring of KV caches, from teh prefill to the decode instances, which become possible due to high-bandwidth infiniband connections between servers. The dist-serv paper additionally examines layer parallelism vs. tensor parallelism (also called inter-op vs. intra-op parallelism), where lower latency communication between servers is needed for tensor parallelism. The splitwise paper, additionally (i) analyzes the first trace of actual LLM requests coming to Azure servers (part of which has been made public at AzurePublicDataset) (ii) presents an algorithm for cluster provisioning (how many machines in the cluster to use as prefill instances, vs. how many as decode instances), using a simulator to get expected latency numbers (iii) talks about power requirements, and that decode instances can be run at stricter power caps or on older architecture like Ampere, unlike prefill instances which are compute heavy and therefore need higher power and/or newer architecture like Hopper. Ideas from these papers are now part of Nvidia's official LLM inference server Dynamo. Paper: 2024_osdi-distserve.pdf, Code: distserve, Paper: 2024_isca-splitwise.pdf, Code: splitwise-sim

4. ServerlessLLM, BlitzScale In serverless paradigm, companies owning LLM models (either trained from scratch or fine-tuned) host inference services for clients on costly GPU hardware infrastrustcure. These companies should be able to scale the GPU hardware dynamically, based on rise and fall in client requests. But the first client request routed to a GPU instance, which doesn't have the LLM model loaded in GPU memory, can suffer from very high TTFT. These two papers have opposite assumptions, that communication between GPU servers is slow (serverlessLLM) vs. fast (blitzScale). ServerlessLLM loads LLM model from local host memory hierarchy like SSD to GPU RAM or DRAM to GPU RAM. It uses a novel model representation, which is faster to load in chunks compared to baseline methods of model loading. The paper also proposes live-migration of requests to a new server, if the scheduler deems that more optimal for inference latencies. The scheduler itself is the third research contribution of the paper, to track which LLM model is in which memory hierarchy, estimates inference latencies of local inference vs. live migration, and appropriately schedules loading of model or routes requests to the optimal server. The other paper Blitzscale argues for network transfer of LLM models between servers, which they claim to be faster, instead of loading the LLM model from local memory hierarchy to the GPU RAM of the server, which they claim to be slower. Choosing which new GPU instances will have minimal LLM model communication time from a GPU instance that has the LLM model, using greedy heursitic is a contribution of the paper. The second contribution is zigzag scheduling, where inference of earlier layers of the LLM model can already start for new incoming requests, while later layers of the LLM model are still being loaded on a new GPU instance. Paper: serverlessLLM.pdf, Slides: serverlessLLM_slides.pdf, Vaibhav notes: serverlessLLM_notes.pdf, Code: serverlessLLM_code, Paper: blitzScale.pdf, Slides: blitzScale_slides.pdf, Code: blitzScale_code.

Evaluation

Assignments