Forrest Iandola -- Bio and Example Application

Homework 0


Bio:

My name is Forrest Iandola, and I'm a 1st year PhD student in EECS. I have some background in parallel programming, distributed systems, and computer vision. One of my current research interests lies in accelerating and composing common computational kernels on parallel hardware.


Goals for this class:

In this class, I aim to build up a deep understanding of theoretical aspects of parallel computing. For example, I'm interested in learning more about lower bounds on communication and parallel computational complexity.



Application where parallel programming has been used:


Communication-Minimizing 2D Image Convolution in GPU Registers

This is a quick overview of some of my recent joint work with David Sheffield, Michael Anderson, Mangpo Phothilimthana, and Kurt Keutzer. This work uses NVIDIA Fermi and Kepler GPUs.

2D image convolution is ubiquitous in image processing and computer vision problems. For feature extraction algorithms like Histogram of Oriented Gradients (HOG), and in other applications of edge detection, convolving with small filters (e.g. 2x2–7x7) is common. Exploiting parallelism is a common strategy for accelerating convolution.

Parallel processors keep getting faster, but algorithms such as image convolution remain memory bounded on parallel processors such as GPUs. Therefore, reducing memory communication is fundamental to accelerating image convolution. To reduce memory communication, we reorganize the convolution algorithm to prefetch image regions to register, and to do more work per thread. Our algorithm works as follows. Each thread begins by prefetching a region of the image from off-chip memory into its registers. For example, when using a 3x3 convolution filter, we might prefetch a 4x4 region of the image into registers. Then, using a user-provided convolution filter that we store in the constant memory, we compute a region of output pixels. Finally, we write the output pixels back to off-chip memory. For the example with a 3x3 filter and 4x4 region in registers, each thread would produce four output pixels (see the right side of the figure below).



Our algorithm provides the following advantages. Explicit prefetching increases the compiler's freedom to overlap communication and computation. Further, doing more work per thread reduces the total number of memory accesses, it exploits the L1 cache on instruction replays, and it increases instruction-level parallelism. Now that we have discussed how and why our algorithm reduces the time spent in memory communication, we turn to the problem of sweeping the design space of implementation choices. We implement an autotuner that evaluates both the global memory and the texture cache for off-chip storage. We also evaluate the impact of first fetching an image region to shared memory, then distributing the pixels from shared memory to registers. Further, our autotuner sweeps the design space of thread granularities. Depending on filter size, our autotuner's speedups on NVIDIA Fermi and Kepler range from 1.2x to 4.5x over state-of-the-art GPU libraries. Further, we achieve the memory bound for 2x2 and 3x3 filters on Kepler, and we achieve within 2x of the memory bound for these filter sizes on Fermi. The following plot provides a comparison between our communication-minimizing results several other libraries on NVIDIA Kepler GTX680.



Some details: