Evaluation plan: Assignments 50%, Minor1 20%, Major 30%
Piazza link: col730
Topic | Slides | Supplementary Material | |
Introduction | course_content_aug4.pdf | Pacheco, GGKK Chapter 1 | |
[Performance] The only reason for parallelism | Quanitifying performance improvements analytically amdahl_gustafson_aug4.pdf | amdahl_gustafson_alternateslides.pdf, amdahl_gustafson_problems.pdf, Pacheco Chapter 2.6, GGKK Chapter 5 | |
[NVIDIA GPUs and CUDA] GPU architecture, CUDA programming and optimizations | i. Latency vs. throughput, hardware parts of a GPU (SMs, memory hierarchy), programming model (thread, blocks, grids, allocating memory and copying between DRAM and GPU memory, kernel launch) cudabasics_aug8.pdf Small sample programs helloworld.cu, cudamem.cu ii. CUDA optimizations CUDAoptimization1_aug18.pdf, CUDAoptimization2_aug22.pdf, parallelHistogram_aug22.pdf iii. CUDA streams cudastreams_aug25_aug29.pdf parallelconvolution_aug29.pdf Sample stream program simplestream.cu, common.h |
Kirik and Hwu, nvidia1, nvidia2, nvidia3 | |
[Multicore CPUs] Pthreads and OpenMP | i. Pthread and OpenMP programs for basic hello world messages, computing PI where the final sum increment forms a critical section (protected with busy waiting and mutex) and message passing from source to destination thread where destination thread waits using semaphore. OpenMP compilation "gcc -g -Wall -fopenmp -o omp_hello omp_hello.c" and pthread compilation "gcc -g -Wall -o pth_msgpassing pth_msgpassing.c -lpthread".
pth_hello.c,
omp_hello.c,
pth_pi.c,
pth_msgpassing.c,
pthread_openmp_sep8.pdf
ii. Caches for faster memory access, cache coherence necesssary for correctness in multi-threaded programs on multi-core CPUs, false sharing of cache line among threads affect performance, can be handled with padding or thread private data strutures. caches_sep12.pdf, mesi_sep12.pdf, falsesharing_sep14.pdf iii. How locks and barriers are implemented by operating systems, using hardawre instruction primitives. hwsynchronizationprimitives_sep19.pdf iv. How processor cores are connected to each other or to memory, inter-connection networks. interconnection_topology_sep14.pdf v. Examples of concurrency bugs from real world softwares. concurrency-bugs-asplospaper.pdf concurrencybugs_sep22.pdf |
Pacheco Chapters 4 and 5 | |
[Parallel programming on distributed memory] MPI, Mapreduce |
i. Point to point and group communication with MPI
mpi_sep29.pdf
ii. MPI standard, which the various libraries like OpenMPI, MPICH, IntelMPI implement mpi40-standard_oct6.pdf iii. How companies like Uber and Baidu are using MPI for large scale computations uber_horovod_oct6.pdf, horovod-mpi-oct6.pdf, baidu_allreduce_oct6.pdf iv. Map-reduce programming model mapreduce_rutgers_oct10.pdf, mapreducepaper_oct10.pdf |
Pacheco Chapter 3 | |
[How to model parallel programs] |
i. Parallel Random Access Memory (PRAM model), design and time complexity analysis of PRAM algorithms prammodel_oct13.pdf
ii. Foster design methodology for parallel programs: partition, communication, agglomeration, mapping foster_oct18.pdf iii. Bulk Synchoronous Parallel (BSP) model, comprising some global supersteps, each with local computation, communication and synchronization BSP wikipedia, Google Pregel and Apache Hama follow the BSP model of parallel programming. iv. Work Depth model, Brent's Theorem on upper bound brent_nov3.pdf |
Quinn Chapter 3, GGKK Chapter 3, Kumar Chapter 3.1-3.2 kumar chapter 3.pdf | |
[Parallel algorithmic techniques] |
i. Pointer jumping or recursive doubling pointerjumping-oct20.pdf ii. Euler tour technique eulertour_oct27.pdf iii. Tree contraction treecontraction_oct31.pdf iv. Divide and conquer (count starting ones, polynomial evaluation) divide_conquer_nov3.pdf v. Divide and conquer (mergesort, quicksort) sequentialsorting_nov7.pdf, parallelsorting_nov7.pdf |
Kumar Chapter 7 kumar chapter 7.pdf | |
[Concurrency in other programming languages] |
Java and Python javamemorymodel_pythongil_nov10.pdf |