Hardware acceleration vs. algorithmic acceleration:
Can GPU-based processing beat complexity optimization
for CT?
Neophytos Neophytou, Fang Xu, Klaus Mueller
Center for Visual Computing, Computer Science Department, Stony Brook
University
Stony
Èñòî÷íèê: http://www.cs.sunysb.edu/~mueller/papers/SPIE-MedImg07.pdf
ABSTRACT
Three-dimensional computed tomography (CT) is a
compute-intensive process, due to the large amounts of source and destination
data, and this limits the speed at which a reconstruction can be obtained.
There are two main approaches to cope with this problem: (i)
lowering the overall computational complexity via algorithmic means, and/or
(ii) running CT on specialized high-performance hardware. Since the latter
requires considerable capital investment into rather inflexible hardware, the
former option is all one has typically available in a traditional CPU-based
computing environment. However, the emergence of programmable commodity
graphics hardware (GPUs) has changed this situation
in a decisive way. In this paper, we show that GPUs
represent a commodity high-performance parallel architecture that resonates
very well with the computational structure and operations inherent to CT. Using
formal arguments as well as experiments we demonstrate that GPU-based
‘brute-force’ CT (i.e., CT at regular complexity) can be significantly faster
than CPU-based as well as GPU-based CT with optimal complexity, at least for
practical data sizes. Therefore, the answer to the title question: “Can
GPU-based processing beat complexity optimization for CT?” is “Absolutely!”
Keywords: computed tomography, CT, 3D
reconstruction, filtered back projection, inverse Radon Transform, programmable
graphics hardware, GPU
INTRODUCTION
This paper builds on our previous work 6, 9 which
devised various techniques for the GPU-acceleration of a rich set of CT
algorithms, including analytical (Feldkamp FDK3) as
well as iterative (SART, OS-EM) methods, the latter with attenuation correction
and detector blurring for emission tomography. In the current work, our focus
is on contrasting the speedups that can be gained by GPU-acceleration of the
standard backprojection operations with those that
can be gained by algorithmic acceleration aimed at reducing and optimizing the
theoretical computational complexity. This is a new contribution, which will
help theoreticians and practitioners alike to assess the power of commodity hardware
acceleration for practical applications. Our research indicates that
programmable commodity graphics hardware (GPU) offers such tremendous
arithmetic acceleration capabilities for standard CT that it is able to outpace
any complexity reducing algorithm, both for CPU and for GPU implementations of
these. In this study, in order to maintain a fair competition, we only compare
single-CPU with single-GPU solutions, noting that both platforms scale equally
well in terms of the number of processors (at least for the task of backprojection).
The computational complexity for reconstructing an N3
dataset using standard back-projection of ðN/2 single-orbit
projections of size N2 is O(N4). This
complexity can be reduced to O(N3logN) if one performs a polar to Cartesian
regridding in the Fourier space, followed by an
inverse FFT. Algorithmically-speaking, this FFT-based approach is a divide and
conquer (d&c) acceleration. Computationally more
involved is Grangeat’s groundbreaking mathematical
framework4 for exact computed tomography (CT) from cone-beam data, using the
first derivative of the Radon Transform. Although inherently of complexity O(N5),
by using the decomposition due to Marr’s method final complexity could be
lowered to O(N4). Since then, various researchers1,2 have proposed to
improve this complexity even further, also to O(N3 logN),
by exploiting the FFT or other d&c schemes for backprojection (as well as for filtering). However, in
particular the non-FFT d&c methods incur
substantial overhead for data management, which can lower the overall impact
that the complexity reduction has on computational speed.
All of these complexity-reduction approaches have
mainly been implemented on CPUs (although DSPs could
be readily used for the FFTs). But let us assume that
all one has is a high-end commodity PC (as is standard in any research lab),
equipped with a matching GPU board, such as an NVidia
GeForce 8800 GTX (which is available for less than
$500 in any computer outlet). The GeForce 8800 GTX
offers 128 parallel stream processors at IEEE floating point precision (but
note that the number of pipelines keeps increasing at 6-month intervals). Thus,
the maximal degree of parallelism, given proper programming, is 128 (for the GeForce 8800 GTX). For now, let us further assume, for
simplicity, that each pipeline is as fast as a single-core CPU, which is
inherently a single-pipeline architecture (unless MMX or SSE are employed). The
speedup given by arithmetic optimization is N/log(N), which is 56
for N=512 and 102 for N=1024 (larger N are currently
infrequent). Thus, just using these admittedly simple arguments, GPU
acceleration provides potential speedup factors better or equivalent to
algorithmic optimization. But now let us take this argument further and
consider the arithmetic pipelines themselves (before we just assumed that both
pipelines had equal speed). The main operations in backprojection
are 3D (voxel) projections, 2D (image)
interpolations, and scalar (voxel) value additions.
These are in fact the core operations in computer graphics algorithms, and they
are thus well optimized on graphics boards, which were developed to generate
high-quality real-time 3D imagery for computer games. In contrast, for
general-purpose CPUs these operations form just another (completely
un-optimized) sequence of instructions. Therefore it is likely that the
acceleration factor of 128, as estimated above, is even higher, and the result
section of this paper offers measured data to support this point.
We have shown in our previous work that the simple
streaming data model of a backprojection accelerates
very well on GPUs. While a divide & conquer
algorithm can also be implemented, it requires more elaborate data management,
which slows down the data flow and processing. Given the complexity of the
hardware, contrasting these two configurations is difficult using theoretical
arguments of the type employed above. We therefore resort to base our analysis
on measured results.