Graphic-Card Cluster for Astrophysics (GraCCA) -
Performance Tests
Hsi-Yu Schive, Chia-Hung
Chien, Shing-Kwong Wong,
Yu-Chih Tsaia, Tzihong Chiueh
Department of Physics,
Источник:
http://xxx.lanl.gov/ftp/arxiv/papers/0707/0707.2991.pdf
Abstract
In this paper, we describe the
architecture and performance of the GraCCA system, a
Graphic-Card Cluster for Astrophysics simulations. It consists of 16 nodes,
with each node equipped with 2 modern graphic cards, the NVIDIA GeForce 8800 GTX. This computing cluster provides a theoretical
performance of 16.2 TFLOPS. To demonstrate its performance in astrophysics
computation, we have implemented a parallel direct N-body simulation program
with shared time-step algorithm in this system. Our system achieves a measured
performance of 7.1 TFLOPS and a parallel efficiency of 90% for simulating a
globular cluster of 1024K particles. In comparing with the GRAPE-6A cluster at
RIT (Rochester Institute of Technology), the GraCCA
system achieves a more than twice higher measured speed and an even higher
performance-per-dollar ratio. Moreover, our system can handle up to 320M
particles and can serve as a general- purpose computing cluster for a wide
range of astrophysics problems.
Keywords: gravitation; stellar
dynamics; methods: N-body simulations; methods: numerical
1. Introduction
The gravitational N-body
simulation plays a significant role in astrophysics, including planetary
systems, galaxies, galactic nuclei, globular clusters, galaxy clusters, and
large-scale structures of the universe. The number of particles involved
(denoted as N) ranges from O(10) in planetary
systems to O(1010) in cosmological simulations. Since gravity is a long
range force, the main challenge of such simulation lies in the calculation of
all N2 pairwise interactions. Therefore anything
involves particle number exceeding 106 will have to employ chiefly a mean-field
scheme (see below). In the case of collisional
system, the evolution timescale is roughly determined by two-body relaxation
time which is proportional to N/log(N) (Spitzer,
1987). It implies that the total simulation time approximately scales as O(N3) (Giersz & Heggie, 1994; Makino, 1996). Therefore, the size of such
astrophysical simulation is usually limited. For example, for a CPU with 10
GFLOPS (Giga Floating Operations per Second) sustained performance, it would
take more than 5 years to simulate the core collapse in a globular cluster with
N = 64K.
A common way to speed up the
N2 force calculation is to adopt the individual time-step scheme (Aarseth, 1963) along with block time-step algorithm
(McMillan, 1986; Makino, 1991). The former assigns a different and adaptive
time-step to each particle. Since the characteristic time-scale in some
astrophysical simulations varies greatly between a dense region and a sparse
region, it is more efficient to assign an individual time-step to each
particle. The latter normally quantizes the time-steps to the power of two and
advances particles group-by-group. Such an algorithm is especially suitable for
vector machines and cluster computers, since a group of particles may be
advanced in parallel. Moreover, it also reduces the time for predicting the
particle attributes.
An alternative approach to
improve performance is to replace the direct-summation scheme by an approximate
and efficient scheme, which has a better scaling than O(N2).
Examples of such schemes include the Barnes-Hut tree code (Barnes & Hut,
1986), Particle-Mesh (PM) code (Klypin & Holtzman, 1997), Particle-Particle/Particle-Mesh (P3M) code
(Efstathiou & Eastwood, 1981), and
Tree-Particle-Mesh (TPM) code (Xu, 1995). These
schemes are efficient and can deal with a large number of particles.
Accordingly, they are often used in large-scale structure simulations. The
drawbacks of such schemes are the limited accuracy and the incapability to deal
with close encounters, which make them inappropriate to study some physics,
such as the core collapse in globular cluster.
To achieve both accuracy and
efficiency, one needs a high-performance computer with direct-summation
algorithm. The development of GRAPE (GRAvity piPE) (Sugimoto et al., 1990; Makino et al., 2003; Fukushige et al., 2005) is made for this purpose. It is a
special-purpose hardware dedicated to the calculation of gravitational
interactions. By implementing multiple force calculation pipelines to calculate
multiple pairwise interactions in parallel, it
achieves an ultra-high performance. The latest version, GRAPE-6, comprises
12288 pipelines and offers a theoretical performance of 63.04 TFLOPS. There is
also a less powerful version, GRAPE-6A, released in 2005. It is designed for
constructing a PC-GRAPE cluster system, in which each GRAPE-6A card is attached
to one host computer. A single GRAPE-6A card has 24 force calculation pipelines
and offers a theoretical performance of 131.3 GFLOPS. Some research institutes
have constructed such PC-GRAPE clusters (Fukushige et
al., 2005; Johnson & Ates, 2005; Harfst et al., 2007; MODEST1), where the peak performance
is reported to be about 4 TFLOPS. However, the main disadvantages of such
system are the relatively high cost, the low communication bandwidth, and the
lack of flexibility due to its special-purpose design (Portegies
Zwart et al., 2007).
By contrast, the graphic
processing unit (GPU) now provides an alternative for high-performance
computation (Dokken et al., 2005). The original
purpose of GPU is to serve as a graphics accelerator for speeding up image
processing and 3D rendering (e.g., matrix manipulation, lighting, fog effects,
and texturing). Since these kinds of operations usually involve a great number
of data to be processed independently, GPU is designed to work in a Single
Instruction, Multiple Data (SIMD) fashion that processes multiple vertexes and
fragments in parallel. Inspired by its advantages of programmability, high
performance, large memory size, and relatively low cost, the use of GPU for
general-purpose computation (GPGPU2) has become an active area of research ever
since 2004 (Fan et al., 2004; Owens et al., 2005, 2007). The theoretical
performance of GPU has grown from 50 GFLOPS for NV40 GPU in 2004 to more than
500 GFLOPS for G80 GPU (which is adopted in GeForce
8800 GTX graphic card) in late 2006. This high computing power mainly arises
from its fully pipelined architecture plus the high memory bandwidth.
The traditional scheme in
GPGPU works as follows (Pharr & Fernando, 2005; Dokken
et al., 2005). First, physical attributes are stored in a randomly-accessible
memory in GPU, called texture. Next, one uses the high-level shading languages,
such as GLSL3, Cg (Fernando & Kilgard, 2003),
Brook (Buck et al., 2004), or HLSL, to program GPU for desired applications.
After that, one uses graphics application programming interface (API) such as
OpenGL or DirectX to initialize computation, to define simulation size, and to
transfer data between PC and GPU memory. Note that the original design of
graphic card is to render calculation results to the screen, which only
supports 8-bit precision for each variable. So finally, in order to preserve the
32-bit accuracy, one needs to use a method called “frame buffer object” (FBO)
to redirect the calculation result to another texture memory for further
iterations. In addition, this method also makes the iterations in GPU more
efficient. For example in many GPGPU applications, the entire computation may
entirely reside within the GPU memory (except for initializing and storing data
in hard disk), which minimizes the communication between GPU and the host
computer.
In February 2007, the NVIDIA
Corporation releases a new computing architecture in GPU, the Compute Unified
Device Architecture (CUDA) (NVIDIA, 2007), which makes the general-purpose
computation in GPU even more efficient and user friendly. In comparing with the
traditional graphic API, CUDA views GPU as a multithreaded coprocessor with
standard C language interface. All threads that execute the same kernel in GPU
are divided into several thread blocks, and each block contains the same number
of threads. Threads within the same block may share their data through an
on-chip parallel data cache, which is small but has much lower memory latency
than the off-chip DRAMS. So, by storing common and frequently used data in this
fast shared memory, it is possible to remove the memory bandwidth bottleneck for
computation-intensive applications.
For hardware implementation,
all stream processors in GPU are grouped into several multiprocessors. Each
multiprocessor has its own shared memory space and works in a SIMD fashion.
Each thread block mentioned above is executed by only one multiprocessor, so
these threads may share their data through the shared memory. Take the NVIDIA GeForce 8800 GTX graphic card (NVIDIA, 2006) for example.
It consists of 16 multiprocessors. Each multiprocessor is composed of 8 stream
processors and has 16 KB shared memory. By allowing the dual-issue of MAD
(multiplication and addition) and MUL (multiplication) instructions, this
graphic card gives a theoretical computing power of 518.4 GFLOPS. Besides, it
has 768 MB GDDR3 memory (named as the device memory or GPU memory) with memory
bandwidth of 86.4 GB/s and supports IEEE-754 single-precision floating-point
operations. By contrast, the currently most advanced memory bus, dual-channel
DDR2 800, in a workstation has a memory bandwidth of 12.8 GB/s.
Scientific computations such
as finite-element method and particle-particle interactions are especially
suitable for GPGPU applications, since they can easily take advantage of the
parallel-computation architecture of GPU. In previous works, Nyland et al. (2004) and Harris (2005) implemented the
N-body simulation in GPU but with limited performance improvement. More
recently, a 50-fold speedup over Xeon CPU was achieved by using GeForce 8800 GTX graphic card and Cg shading language (Portegies Zwart et al., 2007),
but it is still about an order of magnitude slower than a single GRAPE-6A card.
Elsen et al. (2007) achieved nearly 100 GFLOPS
sustained performance by using ATI X1900XTX graphic card and Brook shading
language. Hamada and Iitaka (2007) proposed the
“Chamomile” scheme by using CUDA, and achieved a performance of 256 GFLOPS for
acceleration calculation only. Belleman et al. (2007)
proposed the “
Based on these works, we have
built a 32-GPU cluster named GraCCA, which is
compatible to CUDA and has achieved a measured performance of about 7 TFLOPS.
In this paper, we describe the architecture and performance of our GPU cluster.
We first describe the hardware architecture in detail in Section 2, and then
our implementation of parallel direct N-body simulation in Section 3. We
discuss the performance measurements in Section 4. In Section 5, we give a
theoretical performance model, and finally a discussion of comparison with
GRAPE, stability of GraCCA, and some future outlook
are given in Section 6.