PARALLEL COMPUTING
EXPERIENCES
WITH CUDA
Michael Garland, Scott Le Grand, John Nickolls,
NVIDIA
Joshua Anderson
Jim Hardwick
TechniScan Medical Systems
Scott Morton, Hess,
Everett Phillips,
Vasily Volkov
Introduction
THE CUDA PROGRAMMING MODEL PROVIDES A STRAIGHTFORWARD
MEANS OF DESCRIBING INHERENTLY PARALLEL COMPUTATIONS, AND NVIDIA’S TESLA GPU
ARCHITECTURE DELIVERS HIGH COMPUTATIONAL
With the transition from singlecore
to multicore processors essentially complete,
virtually all commodity CPUs are now parallel processors. Increasing
parallelism, rather than increasing clock rate, has become the primary engine
of processor performance growth, and this trend is likely to continue. This
raises many important questions about how to productively develop efficient
parallel programs that will scale well across increasingly parallel processors.
Modern
graphics processing units (GPUs) have been at the
leading edge of increasing chip-level parallelism for some time. Current NVIDIA
GPUs are manycore processor
chips, scaling from 8 to 240 cores. This degree of hardware parallelism
reflects the fact that GPU architectures evolved to fit the needs of real-time
computer graphics, a problem domain with tremendous inherent parallelism. With
the advent of the GeForce 8800—the first GPU based on
NVIDIA’s Tesla unified architecture — it has become possible to program GPU
processors directly, as massively parallel processors rather than simply as
graphics API accelerators.
NVIDIA developed the CUDA programming model and
software environment to let programmers write scalable parallel programs using a
straightforward extension of the C language. The CUDA programming model guides
the programmer to expose substantial fine-grained parallelism sufficient for
utilizing massively multithreaded GPUs, while at the
same time providing scalability across the broad spectrum of physical
parallelism available in the range of GPU devices. Because it provides a fairly
simple, minimalist abstraction of parallelism and inherits all the well-known
semantics of C, it lets programmers develop massively parallel programs with
relative ease.
In the year since its release, many developers have
used CUDA to parallelize and accelerate computations across various problem
domains. In this article, we survey some experiences gained in applying CUDA to
a diverse set of problems and the parallel speedups attained by executing key
computations on the GPU.