Teaching

Evaluation plan: Assignments 50%, Minor1 20%, Major 30%

Piazza link: col730

Text Books

Kirk and Hwu, Programming Massively Parallel Processors, Kirk and Hwu
Peter S. Pacheco, An introduction to Parallel Programming, Pacheco
Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar, Introduction to Parallel Computing, GGKK

Course Content

Topic	Slides	Supplementary Material
Introduction	course_content_aug4.pdf	Pacheco, GGKK Chapter 1
[Performance] The only reason for parallelism	Quanitifying performance improvements analytically amdahl_gustafson_aug4.pdf	amdahl_gustafson_alternateslides.pdf, amdahl_gustafson_problems.pdf, Pacheco Chapter 2.6, GGKK Chapter 5
[NVIDIA GPUs and CUDA] GPU architecture, CUDA programming and optimizations	i. Latency vs. throughput, hardware parts of a GPU (SMs, memory hierarchy), programming model (thread, blocks, grids, allocating memory and copying between DRAM and GPU memory, kernel launch) cudabasics_aug8.pdf Small sample programs helloworld.cu, cudamem.cu ii. CUDA optimizations CUDAoptimization1_aug18.pdf, CUDAoptimization2_aug22.pdf, parallelHistogram_aug22.pdf iii. CUDA streams cudastreams_aug25_aug29.pdf parallelconvolution_aug29.pdf Sample stream program simplestream.cu, common.h	Kirik and Hwu, nvidia1, nvidia2, nvidia3
[Multicore CPUs] Pthreads and OpenMP	i. Pthread and OpenMP programs for basic hello world messages, computing PI where the final sum increment forms a critical section (protected with busy waiting and mutex) and message passing from source to destination thread where destination thread waits using semaphore. OpenMP compilation "gcc -g -Wall -fopenmp -o omp_hello omp_hello.c" and pthread compilation "gcc -g -Wall -o pth_msgpassing pth_msgpassing.c -lpthread". pth_hello.c, omp_hello.c, pth_pi.c, pth_msgpassing.c, pthread_openmp_sep8.pdf ii. Caches for faster memory access, cache coherence necesssary for correctness in multi-threaded programs on multi-core CPUs, false sharing of cache line among threads affect performance, can be handled with padding or thread private data strutures. caches_sep12.pdf, mesi_sep12.pdf, falsesharing_sep14.pdf iii. How locks and barriers are implemented by operating systems, using hardawre instruction primitives. hwsynchronizationprimitives_sep19.pdf iv. How processor cores are connected to each other or to memory, inter-connection networks. interconnection_topology_sep14.pdf v. Examples of concurrency bugs from real world softwares. concurrency-bugs-asplospaper.pdf concurrencybugs_sep22.pdf	Pacheco Chapters 4 and 5
[Parallel programming on distributed memory] MPI, Mapreduce	i. Point to point and group communication with MPI mpi_sep29.pdf ii. MPI standard, which the various libraries like OpenMPI, MPICH, IntelMPI implement mpi40-standard_oct6.pdf iii. How companies like Uber and Baidu are using MPI for large scale computations uber_horovod_oct6.pdf, horovod-mpi-oct6.pdf, baidu_allreduce_oct6.pdf iv. Map-reduce programming model mapreduce_rutgers_oct10.pdf, mapreducepaper_oct10.pdf	Pacheco Chapter 3
[How to model parallel programs]	i. Parallel Random Access Memory (PRAM model), design and time complexity analysis of PRAM algorithms prammodel_oct13.pdf ii. Foster design methodology for parallel programs: partition, communication, agglomeration, mapping foster_oct18.pdf iii. Bulk Synchoronous Parallel (BSP) model, comprising some global supersteps, each with local computation, communication and synchronization BSP wikipedia, Google Pregel and Apache Hama follow the BSP model of parallel programming. iv. Work Depth model, Brent's Theorem on upper bound brent_nov3.pdf	Quinn Chapter 3, GGKK Chapter 3, Kumar Chapter 3.1-3.2 kumar chapter 3.pdf
[Parallel algorithmic techniques]	i. Pointer jumping or recursive doubling pointerjumping-oct20.pdf ii. Euler tour technique eulertour_oct27.pdf iii. Tree contraction treecontraction_oct31.pdf iv. Divide and conquer (count starting ones, polynomial evaluation) divide_conquer_nov3.pdf v. Divide and conquer (mergesort, quicksort) sequentialsorting_nov7.pdf, parallelsorting_nov7.pdf	Kumar Chapter 7 kumar chapter 7.pdf
[Concurrency in other programming languages]	Java and Python javamemorymodel_pythongil_nov10.pdf

Assignments

We will strictly check for plagiarism against code available online or within your own class. Offenders will get a fail grade. Submit all assignments on Moodle. You can take a total grace of 7 days for the three assignments. Use them wisely.

[20%] Neural Network implementation and optimizations in CUDA CUDA_assignment.pdf: Due on Sep 12.
[10%] Matrix LU decomposition with Pthreads and OpenMP openmp_pthread_assignment.html: Due on Oct 10. You can start at openmp-helloworld.
[20%] PageRank using Map Reduce programming paradigm over MPI MPI_assignment.html: Due on Nov 15.