Hardware acceleration vs. algorithmic acceleration:

Can GPU-based processing beat complexity optimization for CT?

Neophytos Neophytou, Fang Xu, Klaus Mueller

Center for Visual Computing, Computer Science Department, Stony Brook University

Stony Brook, NY, USA 11794-4400

 

Èñòî÷íèê: http://www.cs.sunysb.edu/~mueller/papers/SPIE-MedImg07.pdf

 

 

ABSTRACT

 

Three-dimensional computed tomography (CT) is a compute-intensive process, due to the large amounts of source and destination data, and this limits the speed at which a reconstruction can be obtained. There are two main approaches to cope with this problem: (i) lowering the overall computational complexity via algorithmic means, and/or (ii) running CT on specialized high-performance hardware. Since the latter requires considerable capital investment into rather inflexible hardware, the former option is all one has typically available in a traditional CPU-based computing environment. However, the emergence of programmable commodity graphics hardware (GPUs) has changed this situation in a decisive way. In this paper, we show that GPUs represent a commodity high-performance parallel architecture that resonates very well with the computational structure and operations inherent to CT. Using formal arguments as well as experiments we demonstrate that GPU-based ‘brute-force’ CT (i.e., CT at regular complexity) can be significantly faster than CPU-based as well as GPU-based CT with optimal complexity, at least for practical data sizes. Therefore, the answer to the title question: “Can GPU-based processing beat complexity optimization for CT?” is “Absolutely!”

 

Keywords: computed tomography, CT, 3D reconstruction, filtered back projection, inverse Radon Transform, programmable graphics hardware, GPU

 

INTRODUCTION

 

This paper builds on our previous work 6, 9 which devised various techniques for the GPU-acceleration of a rich set of CT algorithms, including analytical (Feldkamp FDK3) as well as iterative (SART, OS-EM) methods, the latter with attenuation correction and detector blurring for emission tomography. In the current work, our focus is on contrasting the speedups that can be gained by GPU-acceleration of the standard backprojection operations with those that can be gained by algorithmic acceleration aimed at reducing and optimizing the theoretical computational complexity. This is a new contribution, which will help theoreticians and practitioners alike to assess the power of commodity hardware acceleration for practical applications. Our research indicates that programmable commodity graphics hardware (GPU) offers such tremendous arithmetic acceleration capabilities for standard CT that it is able to outpace any complexity reducing algorithm, both for CPU and for GPU implementations of these. In this study, in order to maintain a fair competition, we only compare single-CPU with single-GPU solutions, noting that both platforms scale equally well in terms of the number of processors (at least for the task of backprojection).

The computational complexity for reconstructing an N3 dataset using standard back-projection of ðN/2 single-orbit projections of size N2 is O(N4). This complexity can be reduced to O(N3logN) if one performs a polar to Cartesian regridding in the Fourier space, followed by an inverse FFT. Algorithmically-speaking, this FFT-based approach is a divide and conquer (d&c) acceleration. Computationally more involved is Grangeat’s groundbreaking mathematical framework4 for exact computed tomography (CT) from cone-beam data, using the first derivative of the Radon Transform. Although inherently of complexity O(N5), by using the decomposition due to Marr’s method final complexity could be lowered to O(N4). Since then, various researchers1,2 have proposed to improve this complexity even further, also to O(N3 logN), by exploiting the FFT or other d&c schemes for backprojection (as well as for filtering). However, in particular the non-FFT d&c methods incur substantial overhead for data management, which can lower the overall impact that the complexity reduction has on computational speed.

All of these complexity-reduction approaches have mainly been implemented on CPUs (although DSPs could be readily used for the FFTs). But let us assume that all one has is a high-end commodity PC (as is standard in any research lab), equipped with a matching GPU board, such as an NVidia GeForce 8800 GTX (which is available for less than $500 in any computer outlet). The GeForce 8800 GTX offers 128 parallel stream processors at IEEE floating point precision (but note that the number of pipelines keeps increasing at 6-month intervals). Thus, the maximal degree of parallelism, given proper programming, is 128 (for the GeForce 8800 GTX). For now, let us further assume, for simplicity, that each pipeline is as fast as a single-core CPU, which is inherently a single-pipeline architecture (unless MMX or SSE are employed). The speedup given by arithmetic optimization is N/log(N), which is 56 for N=512 and 102 for N=1024 (larger N are currently infrequent). Thus, just using these admittedly simple arguments, GPU acceleration provides potential speedup factors better or equivalent to algorithmic optimization. But now let us take this argument further and consider the arithmetic pipelines themselves (before we just assumed that both pipelines had equal speed). The main operations in backprojection are 3D (voxel) projections, 2D (image) interpolations, and scalar (voxel) value additions. These are in fact the core operations in computer graphics algorithms, and they are thus well optimized on graphics boards, which were developed to generate high-quality real-time 3D imagery for computer games. In contrast, for general-purpose CPUs these operations form just another (completely un-optimized) sequence of instructions. Therefore it is likely that the acceleration factor of 128, as estimated above, is even higher, and the result section of this paper offers measured data to support this point.

We have shown in our previous work that the simple streaming data model of a backprojection accelerates very well on GPUs. While a divide & conquer algorithm can also be implemented, it requires more elaborate data management, which slows down the data flow and processing. Given the complexity of the hardware, contrasting these two configurations is difficult using theoretical arguments of the type employed above. We therefore resort to base our analysis on measured results.