Graphic-Card Cluster for Astrophysics (GraCCA)

Graphic-Card Cluster for Astrophysics (GraCCA) -

Performance Tests

Hsi-Yu Schive, Chia-Hung Chien, Shing-Kwong Wong,

Yu-Chih Tsaia, Tzihong Chiueh

Department of Physics, National Taiwan University, Taipei, Taiwan

Источник: http://xxx.lanl.gov/ftp/arxiv/papers/0707/0707.2991.pdf

Abstract

In this paper, we describe the architecture and performance of the GraCCA system, a Graphic-Card Cluster for Astrophysics simulations. It consists of 16 nodes, with each node equipped with 2 modern graphic cards, the NVIDIA GeForce 8800 GTX. This computing cluster provides a theoretical performance of 16.2 TFLOPS. To demonstrate its performance in astrophysics computation, we have implemented a parallel direct N-body simulation program with shared time-step algorithm in this system. Our system achieves a measured performance of 7.1 TFLOPS and a parallel efficiency of 90% for simulating a globular cluster of 1024K particles. In comparing with the GRAPE-6A cluster at RIT (Rochester Institute of Technology), the GraCCA system achieves a more than twice higher measured speed and an even higher performance-per-dollar ratio. Moreover, our system can handle up to 320M particles and can serve as a general- purpose computing cluster for a wide range of astrophysics problems.

Keywords: gravitation; stellar dynamics; methods: N-body simulations; methods: numerical

1. Introduction

The gravitational N-body simulation plays a significant role in astrophysics, including planetary systems, galaxies, galactic nuclei, globular clusters, galaxy clusters, and large-scale structures of the universe. The number of particles involved (denoted as N) ranges from O(10) in planetary systems to O(1010) in cosmological simulations. Since gravity is a long range force, the main challenge of such simulation lies in the calculation of all N2 pairwise interactions. Therefore anything involves particle number exceeding 106 will have to employ chiefly a mean-field scheme (see below). In the case of collisional system, the evolution timescale is roughly determined by two-body relaxation time which is proportional to N/log(N) (Spitzer, 1987). It implies that the total simulation time approximately scales as O(N3) (Giersz & Heggie, 1994; Makino, 1996). Therefore, the size of such astrophysical simulation is usually limited. For example, for a CPU with 10 GFLOPS (Giga Floating Operations per Second) sustained performance, it would take more than 5 years to simulate the core collapse in a globular cluster with N = 64K.

A common way to speed up the N2 force calculation is to adopt the individual time-step scheme (Aarseth, 1963) along with block time-step algorithm (McMillan, 1986; Makino, 1991). The former assigns a different and adaptive time-step to each particle. Since the characteristic time-scale in some astrophysical simulations varies greatly between a dense region and a sparse region, it is more efficient to assign an individual time-step to each particle. The latter normally quantizes the time-steps to the power of two and advances particles group-by-group. Such an algorithm is especially suitable for vector machines and cluster computers, since a group of particles may be advanced in parallel. Moreover, it also reduces the time for predicting the particle attributes.

An alternative approach to improve performance is to replace the direct-summation scheme by an approximate and efficient scheme, which has a better scaling than O(N2). Examples of such schemes include the Barnes-Hut tree code (Barnes & Hut, 1986), Particle-Mesh (PM) code (Klypin & Holtzman, 1997), Particle-Particle/Particle-Mesh (P3M) code (Efstathiou & Eastwood, 1981), and Tree-Particle-Mesh (TPM) code (Xu, 1995). These schemes are efficient and can deal with a large number of particles. Accordingly, they are often used in large-scale structure simulations. The drawbacks of such schemes are the limited accuracy and the incapability to deal with close encounters, which make them inappropriate to study some physics, such as the core collapse in globular cluster.

To achieve both accuracy and efficiency, one needs a high-performance computer with direct-summation algorithm. The development of GRAPE (GRAvity piPE) (Sugimoto et al., 1990; Makino et al., 2003; Fukushige et al., 2005) is made for this purpose. It is a special-purpose hardware dedicated to the calculation of gravitational interactions. By implementing multiple force calculation pipelines to calculate multiple pairwise interactions in parallel, it achieves an ultra-high performance. The latest version, GRAPE-6, comprises 12288 pipelines and offers a theoretical performance of 63.04 TFLOPS. There is also a less powerful version, GRAPE-6A, released in 2005. It is designed for constructing a PC-GRAPE cluster system, in which each GRAPE-6A card is attached to one host computer. A single GRAPE-6A card has 24 force calculation pipelines and offers a theoretical performance of 131.3 GFLOPS. Some research institutes have constructed such PC-GRAPE clusters (Fukushige et al., 2005; Johnson & Ates, 2005; Harfst et al., 2007; MODEST1), where the peak performance is reported to be about 4 TFLOPS. However, the main disadvantages of such system are the relatively high cost, the low communication bandwidth, and the lack of flexibility due to its special-purpose design (Portegies Zwart et al., 2007).

By contrast, the graphic processing unit (GPU) now provides an alternative for high-performance computation (Dokken et al., 2005). The original purpose of GPU is to serve as a graphics accelerator for speeding up image processing and 3D rendering (e.g., matrix manipulation, lighting, fog effects, and texturing). Since these kinds of operations usually involve a great number of data to be processed independently, GPU is designed to work in a Single Instruction, Multiple Data (SIMD) fashion that processes multiple vertexes and fragments in parallel. Inspired by its advantages of programmability, high performance, large memory size, and relatively low cost, the use of GPU for general-purpose computation (GPGPU2) has become an active area of research ever since 2004 (Fan et al., 2004; Owens et al., 2005, 2007). The theoretical performance of GPU has grown from 50 GFLOPS for NV40 GPU in 2004 to more than 500 GFLOPS for G80 GPU (which is adopted in GeForce 8800 GTX graphic card) in late 2006. This high computing power mainly arises from its fully pipelined architecture plus the high memory bandwidth.

The traditional scheme in GPGPU works as follows (Pharr & Fernando, 2005; Dokken et al., 2005). First, physical attributes are stored in a randomly-accessible memory in GPU, called texture. Next, one uses the high-level shading languages, such as GLSL3, Cg (Fernando & Kilgard, 2003), Brook (Buck et al., 2004), or HLSL, to program GPU for desired applications. After that, one uses graphics application programming interface (API) such as OpenGL or DirectX to initialize computation, to define simulation size, and to transfer data between PC and GPU memory. Note that the original design of graphic card is to render calculation results to the screen, which only supports 8-bit precision for each variable. So finally, in order to preserve the 32-bit accuracy, one needs to use a method called “frame buffer object” (FBO) to redirect the calculation result to another texture memory for further iterations. In addition, this method also makes the iterations in GPU more efficient. For example in many GPGPU applications, the entire computation may entirely reside within the GPU memory (except for initializing and storing data in hard disk), which minimizes the communication between GPU and the host computer.

In February 2007, the NVIDIA Corporation releases a new computing architecture in GPU, the Compute Unified Device Architecture (CUDA) (NVIDIA, 2007), which makes the general-purpose computation in GPU even more efficient and user friendly. In comparing with the traditional graphic API, CUDA views GPU as a multithreaded coprocessor with standard C language interface. All threads that execute the same kernel in GPU are divided into several thread blocks, and each block contains the same number of threads. Threads within the same block may share their data through an on-chip parallel data cache, which is small but has much lower memory latency than the off-chip DRAMS. So, by storing common and frequently used data in this fast shared memory, it is possible to remove the memory bandwidth bottleneck for computation-intensive applications.

For hardware implementation, all stream processors in GPU are grouped into several multiprocessors. Each multiprocessor has its own shared memory space and works in a SIMD fashion. Each thread block mentioned above is executed by only one multiprocessor, so these threads may share their data through the shared memory. Take the NVIDIA GeForce 8800 GTX graphic card (NVIDIA, 2006) for example. It consists of 16 multiprocessors. Each multiprocessor is composed of 8 stream processors and has 16 KB shared memory. By allowing the dual-issue of MAD (multiplication and addition) and MUL (multiplication) instructions, this graphic card gives a theoretical computing power of 518.4 GFLOPS. Besides, it has 768 MB GDDR3 memory (named as the device memory or GPU memory) with memory bandwidth of 86.4 GB/s and supports IEEE-754 single-precision floating-point operations. By contrast, the currently most advanced memory bus, dual-channel DDR2 800, in a workstation has a memory bandwidth of 12.8 GB/s.

Scientific computations such as finite-element method and particle-particle interactions are especially suitable for GPGPU applications, since they can easily take advantage of the parallel-computation architecture of GPU. In previous works, Nyland et al. (2004) and Harris (2005) implemented the N-body simulation in GPU but with limited performance improvement. More recently, a 50-fold speedup over Xeon CPU was achieved by using GeForce 8800 GTX graphic card and Cg shading language (Portegies Zwart et al., 2007), but it is still about an order of magnitude slower than a single GRAPE-6A card. Elsen et al. (2007) achieved nearly 100 GFLOPS sustained performance by using ATI X1900XTX graphic card and Brook shading language. Hamada and Iitaka (2007) proposed the “Chamomile” scheme by using CUDA, and achieved a performance of 256 GFLOPS for acceleration calculation only. Belleman et al. (2007) proposed the “Kirin” scheme also by using CUDA, and achieved a performance of 236 GFLOPS for acceleration, jerk, and potential calculations. Although the works of Hamada & Iitaka, and Belleman et al. have outperformed what can be achieved by a single GRAPE-6A card, these are either a sequential code that applies to a single GPU (Hamada & Iitaka, 2007) or a parallel code but only has been tested on a 2-GPU system (Belleman et al., 2007). Consequently, their performances are still incomparable to those of GRAPE-6 and GRAPE-6A cluster.

Based on these works, we have built a 32-GPU cluster named GraCCA, which is compatible to CUDA and has achieved a measured performance of about 7 TFLOPS. In this paper, we describe the architecture and performance of our GPU cluster. We first describe the hardware architecture in detail in Section 2, and then our implementation of parallel direct N-body simulation in Section 3. We discuss the performance measurements in Section 4. In Section 5, we give a theoretical performance model, and finally a discussion of comparison with GRAPE, stability of GraCCA, and some future outlook are given in Section 6.