PARALLEL COMPUTING EXPERIENCES WITH CUDA


Michael Garland, Scott Le Grand, John Nickolls
NVIDIA
Joshua Anderson
Iowa State University and Ames Laboratory
Jim Hardwick
TechniScan Medical Systems
Scott Morton, Hess, Everett Phillips, Yao Zhang
University of California, Davis
Vasily Volkov
University of California, Berkeley

Èñòî÷íèê: http://www2.computer.org/cms/Computer.org/ComputingNow/homepage/0908/ ParallelComputingExperienceswithCUDA.pdf

Introduction

THE CUDA PROGRAMMING MODEL PROVIDES A STRAIGHTFORWARD MEANS OF DESCRIBING INHERENTLY PARALLEL COMPUTATIONS, AND NVIDIA’S TESLA GPU ARCHITECTURE DELIVERS HIGH COMPUTATIONAL THROUGHPUT ON MASSIVELY PARALLEL PROBLEMS. THIS ARTICLE SURVEYS EXPERIENCES GAINED IN APPLYING CUDA TO A DIVERSE SET OF PROBLEMS AND THE PARALLEL SPEEDUPS OVER SEQUENTIAL CODES RUNNING ON TRADITIONAL CPU ARCHITECTURES ATTAINED BY EXECUTING KEY COMPUTATIONS ON THE GPU.

With the transition from singlecore to multicore processors essentially complete, virtually all commodity CPUs are now parallel processors. Increasing parallelism, rather than increasing clock rate, has become the primary engine of processor performance growth, and this trend is likely to continue. This raises many important questions about how to productively develop efficient parallel programs that will scale well across increasingly parallel processors.

Modern graphics processing units (GPUs) have been at the leading edge of increasing chip-level parallelism for some time. Current NVIDIA GPUs are manycore processor chips, scaling from 8 to 240 cores. This degree of hardware parallelism reflects the fact that GPU architectures evolved to fit the needs of real-time computer graphics, a problem domain with tremendous inherent parallelism. With the advent of the GeForce 8800—the first GPU based on NVIDIA’s Tesla unified architecture — it has become possible to program GPU processors directly, as massively parallel processors rather than simply as graphics API accelerators.

NVIDIA developed the CUDA programming model and software environment to let programmers write scalable parallel programs using a straightforward extension of the C language. The CUDA programming model guides the programmer to expose substantial fine-grained parallelism sufficient for utilizing massively multithreaded GPUs, while at the same time providing scalability across the broad spectrum of physical parallelism available in the range of GPU devices. Because it provides a fairly simple, minimalist abstraction of parallelism and inherits all the well-known semantics of C, it lets programmers develop massively parallel programs with relative ease.

In the year since its release, many developers have used CUDA to parallelize and accelerate computations across various problem domains. In this article, we survey some experiences gained in applying CUDA to a diverse set of problems and the parallel speedups attained by executing key computations on the GPU.