PMLIB - Partitioned Memory Parallel Programming Library

Based on BSP-RAMP: Partitioned Memory Parallel Programming Framework

We are designing a new framework for parallel programming called BSP-RAMP (named after the Bulk Synchronous PRAM model with the last `P' for partitioned memory). It consists of a distributed shared memory based simplified programming model, which leaves the application developer to focus mainly on task decomposition. This is a unified model for many-core processors (e.g., CPUs and GPUs), multiple processors on a system, as well as multiple systems. We are also developing a library implementation as a proof of concept of the model. We call our library PMLIB (Partitioned Memory parallel programming Library). The library is capable of efficiently mapping tasks to multiple compute engines, performs the required communication and schedules tasks to completion. In addition to convenience, the framework provides a race free and deadlock free programming environment by letting tasks own a partition of the memory. This simplifies programming significantly. We have done a number of experiments with PMLIB. The experiments include array summation; three scientific applications - Matrix Multiplication, N-body Simulation and Fast Fourier Transformation; a commercial application that estimates stock pricing based on Monte Carlo equations; an implementation of PageRank, which is a popular example based on MapReduce-style programming; and a zlib based compression program.

An important design decision for PMLIB is to provide a minimalistic interface to parallelize programs, with optional controls to improve efficiency when necessary. The application need not deal with thread creation, management or destruction. Moreover, much of GPU complexities (like CUDA kernel invocation and data transfer between CPU and GPU) are also handled automatically by PMLIB. The division of work into coarse-grained tasks with explicit synchronization points provides freedom from the overheads associated with fine-grained synchronization, such as race conditions and deadlocks. At present, we are working on techniques that provide fault tolerance and automatic dynamic load balancing among multiple heterogeneous processing elements in the cluster. The following sections provide more detail on the model and it's implementation.

1. Overview

The BSP-RAMP programming model is designed to incrementally parallelize sections of an existing or new sequential program. The input and output memory, of the sequential code section to be parallelized, are partitioned among different processing elements each of which essentially runs a code similar to the original sequential program (but on a reduced data set). The advantage of this approach is that the code can always be debugged in sequential setup before being parallelized. After input and output data are distributed, the corresponding processing elements use them to perform the required computations. When all the processing elements finish their computations, they enter a barrier synchronization mode where they optionally synchronize their computational results with other processing elements. The model also allows a reduction type synchronization where data from different processing elements get associatively reduced to a final value (e.g. data from different processing elements could be added together to produce the final result).

Each parallel computation in this model, consists of a series of 3-stage tasks. The three stages being data distribution, computation and synchronization. The instance of each task which runs on each processing element in parallel is known as a subtask. The model supports running subtasks on all kinds of heterogeneous computing devices (shared memory based, distributed memory based and independent graphics cards). The library implementation of this model called PMLIB uses OpenMP, MPI and CUDA underneath and allows parallelizing a program over CPU and GPU cores of all the machines in a computer network. The heterogeneity of the underlying devices, scheduling, scalability, network communications, thread and process management are completely hidden from the application programmer. The application just needs to provide CPU and GPU code for it's specific computation and rest all is transparently managed by the library.

2. Comparison to BSPRAM

The following table highlights major differences between BSP-RAMP model and BSPRAM model. More details about BSPRAM are available in the paper "The Bulk-Synchronous Parallel Random Access Machine" by Alexandre Tiskin.

Programming Model Aspect	BSPRAM	BSP-RAMP
Data Distribution	No input/output data distribution. All processes see entire memory.	Each subtask owns and sees a partition of input/output memory. Optionally, subtasks may own exclusive overlapping partitions (could be entire memory).
Data Visibility	Programs see two levels of memory organization - local exclusive and global shared.	Programs see only one level of partitioned exclusive memory.

There are certain other differences that a typical model implementation and the final parallel application implementation would point out. The following table lists those differences and the advantages of using BSP-RAMP model over BSPRAM model.

Implementation Aspect	BSPRAM	BSP-RAMP
Computation Code	Serial code might need restructuring to fit into the parallel context.	Subtask code is generally similar to serial code; it just operates on a smaller data set. Most advantages of using serial code like debugging are preserved.
Reduction	Implementations do conflict resolution to arrive on a final value from a set.	Optimized hierarchical reduction of output partitions is done first within multiple hosts in parallel and then across hosts.
Heterogeneity	A shared main memory view prevents efficient incorporation of devices like GPU cards into the model.	Partitioned memory view allows partitions to be optimized as per memory hierarchy of the target device (like global and shared GPU memories) and of the entire cluster.

3. Experimental Results

The table below lists down some parallel programs created with PMLIB and their experimental results. Most of the experiments were performed on a cluster of four 64-bit eight core machines with Intel Xeon E5450 3.00 GHz CPU running Ubuntu Linux 8.04.2 connected using a 1Gbps ethernet network. Two of the hosts also have a Tesla C1060 card each. The last two columns of the table list down the speedups achieved on CPU and GPU clusters by parallel PMLIB program over the corresponding serial implementation. To view the details of any experiment, follow the link in the second column.

Category	Experiment	Serial Execution Time (in seconds)	Parallel Execution Time on PMLIB (in seconds)		PMLIB Speedup
Category	Experiment	Serial Execution Time (in seconds)	CPU cluster	GPU cluster	CPU cluster	GPU cluster
Scientific	Square Matrix Multiplication	919.2	42.7	12.3	21.52x	74.73x
	N-body Simulation	2197864.0	7392.68	219.23	297.3x	10025.3x
	Fast Fourier Transform	40.7	8.79	24.77	4.63x	1.64x
Commercial	Monte Carlo Stock Pricing	5891.24	187.67	2.34	31.39x	2517.6x
Data Analytic	Page Rank	1292.98	380.6	408.1	3.4x	3.17x
Utility	Data Compression	141.78	18.57	-	7.63x	-
Others	Array Summation	135.76	8.58	2.52	15.8x	53.87x

Square Matrix Multiplication

This application multiplies two square matrices of large dimensions. To parallelize the routine, we divide the rows of the first input matrix equally among subtasks. Each subtask sees the full second input matrix, and computes its part of the result. We used a simple CUDA kernel for the experiment. Our kernel uses only the device's (slow) global memory and ignores its (faster) shared memory. A more efficient CUDA kernel that exploits GPU's memory hierarchy better is likely to scale even better. For the largest matrix size of 5000*5000, we obtain a speedup of about 75x. In general, the speedup increases with increasing matrix size, as more computation is done per unit of communication for larger matrices due to the O(n3) nature of the algorithm.

View Results

N-body Simulation

This experiment is a simulation of a system of particles (or physical bodies having mass), which exert mutual gravitational pull on each other. The motion of each particle is governed by the resultant force from all other bodies in the system. We use a brute force O(n2) approach to compute gravitational forces between every pair of particles. Using PMLIB, we equally partition the work among all available processing elements.We used the kernel from CUDA SDK version 2.3 for this experiment. On GPU cluster, the parallel version of the benchmark is 326x faster than the serial version for a simulation of 50,000 N-bodies with 100 iterations. As with the matrix multiplication example, the advantage of using parallel processors increases with increasing computation. For a system of 100,000 N-bodies, the speedup (for 100 iterations) over serial implementation is 100253x! This astonishing result is due to the large number of GPU threads running in parallel.

View Results

Fast Fourier Transform

This experiment parallelizes a well-known discrete fourier transform algorithm, called the Fast Fourier Transform (FFT). We parallelize the computation of 2-D FFT for a matrix of randomly generated time domain complex numbers. To parallelize, we first compute 1-D FFT over all rows of the matrix (in parallel) and then compute 1-D FFT over all columns of the resulting matrix (also in parallel). The 1-D FFT computations on rows and columns is separated by a sync-point (one task finishes and another starts). We use a CUDA kernel available in the publicly available CUFFT library for this experiment. Due to the large amount of data transfer involved in this experiment, the communication cost far exceeds the computation time. As a result, computation is fastest when done on one host rather than on the cluster. For an input matrix of dimension 2^12*2^15, the time taken for 2-D FFT on one host is 8.79 seconds, while on 4 hosts it is 35.67 seconds. The amount of data transferred in this process is more than 3.0 GB, which on our 1Gbps network takes about 31.5 seconds. So the overall computation time spent by each host in parallel setup is less than 3s (this time includes the library's overheads - memory allocation, buffering, thread management and scheduling). We expect this experiment to scale better on multiple hosts if a faster network (e.g., 100Gbps ethernet) is available. On our experimental setup, we achieve a speedup of 4.63x by parallelizing on 8 CPU cores of a single host.

View Results

Monte Carlo Stock Pricing

Monte Carlo Stock Pricing is a method of computing the expected future price of a set of stock options by considering variations along many possible paths, each of which is randomly chosen using a normal distribution. The price of the stock is assumed to follow a stochastic differential equation on each path. The final price is the average of expected prices over all paths. This process is repeated for a large number of independent stock options. We parallelize different stock options using different subtasks. We used a kernel from CUDA SDK version 2.3 for our GPU implementation of the application. For our GPU cluster, we achieve a speedup of 2517x over serial implementation for 1024 options considering variations along 2^26 paths on our experimental setup. Once again, the high speed is due to GPU's high suitability to such applications.

View Results

Page Rank

PageRank is an example of a data-analytic algorithm that computes the search rank of a webpage based on the web's hyperlink structure. The search rank of a webpage is the probability of a random surfer visiting it. The algorithm works by first uniformly initializing the ranks of each page to a constant value, and then iteratively transferring the ranks of all web pages to their outlinks till the ranks of all pages converge (or till a maximum number of iterations). PageRank is a popular application of the Map-Reduce paradigm. We generated random web graphs of different sizes and then tested our PageRank implementation on them. We also wrote a CUDA kernel for the PageRank algorithm, and ported our serial implementation to PMLIB. Our CUDA implementation for the application is relatively dumb. The application has a large memory footprint and data sharing between CUDA threads. We again only use the GPU's global memory and avoid race conditions among threads of the GPU subtask using slow global memory atomic instructions. We implemented the iterations of the PageRank algorithm using the "callback rebounds" feature of our library. We used reduce callback to add the contributions to a page's rank from all its inlinks (just like this is done in Map-Reduce). We perform 100 iterations of this algorithm to arrive at a PageRank.

PageRank being a data-intensive application causes communication cost to dominate. Our implementation's data compression feature (during transfer) significantly reduces this communication cost. Like FFT, computation of PageRank is faster on CPU cores than on CUDA devices. However, the PageRank's computation-to-communication ratio is better than that of FFT. Hence, we obtain performance improvements by parallelizing over multiple hosts. Our parallel implementation is 3.4x faster over serial implementation for a web size of 30 million pages. The bottleneck in this experiment is always the data transfer time, and a faster network will improve results.

View Results

Data Compression

In this experiment, we perform in-memory data compression using zlib. A file is read into memory and then PMLIB is used to compress different blocks (of the file) on different processing elements in parallel. We have not written (or otherwise obtained) a CUDA kernel for this experiment yet. Hence, we only report results on parallelizing over CPUs. Using the 8 CPU cores of 1 host, our parallel program achieves a speedup of 7.63x (over serial implementation) while compressing a 1GB text file.

View Results

Array Summation

This experiment tests the performance of pmlib on a linearly parallelizable problem. It repeatedly computes (25 times for our experiments) the following in-place function on each element of an integer array and then finds the sum of all elements of the resulting array.

f(x) = floor( C * pow( ( sin(x), pow( cos(x), 1/3 ) ) ) )

Here, x is an element of the integer array and C is an integer constant.

To parallelize, each subtask applies the function f(x) in parallel to different parts of the input array and then computes its sum. Finally, all subtasks add their results to find the total sum using the reduce callback. The parallel implementation using our library achieves a speedup of around 54x over a serial implementation for an array of 60 million elements. This speedup is obtained if the computation is parallelized across GPU cores. When parallelizing only on CPUs, we obtain 17x speedup over serial implementation. The high speedup when using GPUs is primarily due to their relatively high compute power: the GPU subtask uses thousands of simultaneous threads.

View Results

4. Future Work

We are working on enriching our model and library implementation with the following features:

Load Balancing

The processing elements in the compute cluster could be heterogeneous - each having different processing capabilities and at times few of them could be overloaded with other work. The conditions on various processing elements could vary dynamically and the programming model must adjust itself to the changing global environment. Load balancing becomes especially important in the context of CUDA devices as there could be a striking performance difference (depending upon the computation) between various GPU cards and between CPU and GPU devices.

Determine Best Launch Configuration

Generally, on an 8-core shared memory machine, having more than 8 threads gives better performance than having one thread per CPU core. In certain situations, the converse might also be true (e.g. when the data being processed is very large to fit in memory and most of the time is spent in thrashing). Similarly, in a distributed memory setup there is a trade off between the processing power and the communication cost when the number of machines increase. A similar tradeoff exists for CUDA devices.

We need to develop a cost model to effectively estimate the increase in cost of using an additional CPU thread or a heterogeneous processing device versus the potential performance gain. The cost model must also be able to estimate the best launch configuration of different CUDA kernels (organization of threads into blocks and blocks into grids).

Efficient use of Global Memory Hierarchy

We would like the model to automatically determine the best input/output memory partition (and the extent of each partition) to be assigned to a particular processing element with the goal of minimizing the overall execution time. From a cluster's perspective, the model needs to optimally organize memory partitions across heterogeneous hosts. From a host's perspective, the model should do the same on local GPU cards and CPU cores. Finally, inside a GPU card the model must manage what should be stored in it's global memory and what should be stored in it's shared memory. A more aggressive optimization must also take care of CPU's cache behavior. The overall goal is to optimally distribute memory partitions at all levels in the cluster keeping the load balanced and increasing the computation throughput as much as possible.