Accelerating the 3-2 Pathtracer with NVIDIA CUDA

George Lee, Daniel Kessler, Abizer Lokhandwala

CS184 Spring 2019

Abstract

Graphics Processing Units employ a specialized, highly-parallel computer architecture to significantly accelerate certain common tasks in generating and rendering realistic 3D graphics. The massive parallelism of GPUs can be taken advantage of to accelerate raytracing, resulting in 100x speedups in rendering performance compared to a CPU-only approach. This report documents our efforts and results in accelerating the Project 3-2 Pathtracer using the NVIDIA CUDA GPU parallel computing API.

Introduction

Raytracing is a means of performing photorealistic rendering, performed by ‘casting’ a ray into a 3D computer model of a scene, and analyzing the intersections of that ray with objects posititioned in the scene. With a model of the properties of the intersected objects and by following some basic laws of physics, we can approximately solve the rendering equation, resulting in an image that in many cases can be very ‘correct’. However, raytracing is an expensive process, as, for every sample of every pixel in the camera’s viewing frame, it requires possibly multiple rays to be cast, each of which may reflect/refract from objects in the scene in various ways, repeatedly, until they are terminated. Even for a small scene, this can number up to millions of rays cast, which is computationally taxing, especially on general-purpose consumer CPUs, which are not designed or optimized for this use case. Fortunately, the problem is highly paralellizable, and lends itself well to execution on GPUs, which are highly parallel by nature. This problem is paralellizable because each ray’s interactions with the scene are self-contained. As a ray bounces through the scene, it picks up radiance from the objects it interacts with independently of any other ray, or object, in the scene. This means that rays require no coordination with one another to come to a solution. In high performance computing this means the problem is ‘embarassingly parallel’ - not only can multiple rays be cast at the same time, but almost every other part of a render job can also be done in parallel, or on entirely different machines.

Technical Approach

While conceptually easy to reason about, actually writing parallel code is not that simple, for several reasons. First, as an individual programmer, it is nontrivial to reason about the parallel execution units found in accelerator hardware, and how exactly to dispatch work to those units. On top of this, every vendor of parallel hardware has a different microarchitecture, and naively writing portable parallel code is made almost impossible as a result. In order to address this, many vendors have come together to create common APIs and vendor-specific translation layers, such as OpenCL. Some vendors, however, continue to maintain proprietary hardware-specific acceleration APIs, such as NVIDIA, with their CUDA parallel computing API.

In our case, we started with the raytracer written for CS184 Project 3, which uses thread parallelism to slightly accelerate the rendering process. However, while better than nothing, this is still pretty inefficient, as it runs on the CPU, and CPU threads are not optimized for this workload. We then migrated the ray tracing part of the project code to utilize the CUDA framework, which is highly optimized for this workload, which resulted in 100x speedups in rendering performance (as measured by reduction in rendering time). To give specific numbers, for example, renders that previously took over 500 seconds on the CPU completed in less than 2 seconds on the GPU.

Process

A significant portion of the work done within this project was to properly setup the environment needed to successfully compile and run CUDA code. In particular, we targeted Debian 9, and had a nontrivial time with the setup. Officially, Ubuntu, which is a Linux distribution based off of Debian, is supported by NVIDIA. Because of this, we hypothesized that Debian would also be compatible with the officially supported distribution built for Ubuntu as the latter bases a significant portion of its package repository on the former. The core library needed to compile CUDA code is nvidia-cuda-toolkit, which is packaged with a specific version of the nvidia graphics drivers needed to run the code.

This hypothesis, while ultimately correct, still turned out to be naive. We ran into significant, and hitherto unexplainable, issues just trying to get the necessary dependencies to compile succesfully, let alone integrate into our codebase. In particular, after installing libglfw3, a dependency of the Project 3 skeleton, code that was running under a successful pre-existing installation of nvidia-cuda-toolkit would suddenly become unstable, and often end up in a uninterruptible sleep state on the machines we were testing this project on, specifically, the high-performance computing cluster of the Open Computing Facility. On two occasions, recovering from this required a hard restart of the affected machines in the cluster. We haven’t been able to figure out what exactly caused this instability, as the code that failed belonged to other members of the community who were running their projects on the HPC cluster.

At one point, after many hours of debugging, we were able to successfully compile one set of examples from the NVIDIA OptiX SDK, however, when running them on a machine with an NVIDIA GPU, the examples would run at abysmal frame rates - 4fps or less. Though the program recognized the existence of a compatible GPU (GTX 1060), we don’t know why the provided examples weren’t able to take advantage of it, and we believe the low framerates are a result of the framework running in hardware-emulation-in-software mode. Ultimately, we abandoned our attempt to use OptiX, and decided to use CUDA directly. This is because we were able to successfully compile and run example CUDA code, and were able to verify that it took full advantage of the system’s GPU.

After we were able to confirm CUDA worked, we began the process of porting our raytracer to CUDA. Among the CUDA examples were the major parts of a working pathtracer in CUDA, which we used as the basis for our GPU accelerated pathtracer. The majority of the changes we needed to make were to integrate the skeleton from upstream into the application, loading, and BVH code from our project. This required modifying the upstream code to accept the data structures from our project, namely to interact with our BVH structure correctly. While it would be possible to migrate BVH construction to the GPU as well, for the purposes of this project, we continued to build the BVH on the CPU, before init()ing the GPU and cudaMemcpy()ing the resulting datastructure to GPU memory. Other variables are set as members of the CUDAPathTracer C++ class, which is instantiated within CUDA as well. Finally, to actually execute the parallel ray tracing itself, we call traceScene<<<gridDim, blockDim>>>, which runs the raytracing in parallel per pixel over all the available GPU execution units.

Results

For all renders, we were able to achieve a speed-up of at least 100x. Rendering times increased linearly with almost all parameters:


For maximum ray depth, rendering times did not increase linearly, and in fact plateaued at an earlier point for the GPU than the CPU renders. This could be because ray depth is random per ray, and therefore trails off naturally.


When modifying samples per area light, rendering times increase linearly for both CPU and GPU renders.

Similarly, rendering times increased linearly with samples per pixel.

The following images were rendered using the following settings: number of light rays - 100, maximum number of ray bounces - 6, samples per pixel - 8192

Lessons Learned

We learned a number of lessons from working on this project, in particular, dealing with an unideal ops scenario. All of us have significant experience working with Linux, and indeed use Linux as our primary operating system, but part of being a Linux user is being used to suboptimal graphics performance, and though drivers are available for Linux, none of us had any particular experience wrangling GPUs to work with Linux in the explicit context of rendering graphics. We mistakenly came to this thinking that, like many other things on Linux, getting the project to compile would be as simple as apt install optix or apt install cuda. We were sorely mistaken. On Debian, the graphics drivers available in the repositories were often significantly out of date, and installing later drivers required in the installation of new underlying C and C++ libraries that conflicted with others installed on the system. Trying to install this in a container also didn’t work, as this would require passing the GPU into the container, which meant the host was unable to use it, which prevented us from directly seeing the results of our renders, not to mention destabilizing the entire system due to the PCI-passthrough-enabling code itself not being particularly stable in the kernel that ships on Debian 9.

Basically, we learned a lot about debugging packaging and code execution problems on Linux, more so than we learned about writing parallel graphics code itself.

Directions For Future Work

In the future, we hope to be able to parallelize the construction of the BVH using the Z-Order Curve as described in the NVIDIA developer blog, here. Using this method of constructing the BVH further improves the intersection lookup time of the ray tracer by improving the amount of cache hits due to spatial locality induced by the space-filling nature of the curve.

We could also fix bugs in the sampling code that caused it to fail on models with microfacet materials. Our sampling code also appeared to produce images with slightly lower quality than CPU renders, given the same parameters. This could be fixed through upsampling (and we would still be achieving significant performance gains), but it would be worthwhile to investigate why the code was producing noisy renders.

Work Done By Each Team Member

This project was done completely in collaboration. No part of this project was done in isolation or independently, with the exception of actually running some of the renders themselves, which could be done in parallel. We met multiple times for hours on end and pair/trio-programmed while trying to debug and compile the project.