Ph.D. Research

Prof. Subodh Kumar    Prof. Sorav Bansal

Unicorn: A Bulk Synchronous Programming Model, Framework and Runtime for Hybrid CPU-GPU Clusters

The difficulty of programming hybrid CPU-GPU clusters often limits software's exploitation of full computational power. This work addresses this difficulty and presents Unicorn - a novel parallel programming model for hybrid CPU-GPU clusters and the design and implementation of its runtime. In particular, this work proves that efficient distributed shared memory style programing is possible. We also prove that the simplicity of shared memory style programming can be retained across CPUs and GPUs in a cluster, minus the frustration of dealing with race conditions. And this can be done with a unified abstraction, avoiding much of the complication of dealing with hybrid architectures. This is achieved with the help of transactional semantics, deferred bulk data synchronization, workload pipelining and various communication and computation scheduling optimizations.

We find that parallelization of applications like matrix multiplication or 2D FFT using our system requires only about 30 lines of C code to setup the runtime. The rest of the application is regular single CPU/GPU implementation. This indicates the ease of extending sequential code to parallel environment. The execution is efficient as well. When multiplying two square matrices of size 65536*65536, Unicorn achieves a peak performance of 7.88 TFlop/s when run over a 14 node cluster with each node equipped with two Tesla M2070 GPUs and two 6-core Intel Xeon X5650 2.67 GHz CPUs and connected over a 32Gbps Infiniband network.