Comparison of Parallel Programming Models on Clusters of SMP Nodes
Rolf Rabenseifner and Gerhard Wellein
High-Performance Computing-Center (HLRS), University of Stuttgart,
Allmandring 30, D-70550 Stuttgart, Germany,
rabenseifner@hlrs.de, www.hlrs.de/people/rabenseifner/
Regionales Rechenzentrum Erlangen, Martensstrae 1, D-91058 Erlangen,
Germany,
gerhard.wellein@rrze.uni-erlangen.de
Abstract
Most HPC systems are clusters of shared memory nodes. Parallel programming must combine the distributed memory parallelization on the node interconnect with the shared memory parallelization inside of each node. Various hybrid MPI+OpenMP programming models are compared with pure MPI. Benchmark results of several platforms are presented. This paper analyzes the strength and weakness of several parallel programming models on clusters of SMP nodes. There are several mismatch problems between the (hybrid) programming schemes and the hybrid hardware architectures. Benchmark results on a Myrinet cluster and on recent Cray, NEC, IBM, Hitachi, SUN and SGI platforms show, that the hybridmasteronly programming model can be used more efficiently on some vector-type systems, but also on clusters of dual-CPUs. On other systems, one CPU is not able to saturate the inter-node network and the commonly used masteronly programming model soffers from insufficient inter-node bandwidth. This paper analyses strategies to overcome typical drawbacks of this easily usable programming scheme on systems with weaker inter-connects. Best performance can be achieved with overlapping communication and computation, but this scheme is lacking in ease of use.
Keywords. OpenMP, MPI, Hybrid Parallel Programming, Threads and MPI, HPC, Performance.
1 Introduction
Most systems in High Performance Computing are clusters of shared memory nodes. Such hybrid systems range from small clusters of dual-CPU PCs up to largest systems like the Earth Simulator consisting of 640 SMP nodes connected by a single-stage cross-bar and with SMP nodes combining 8 vector CPUs on a shared memory [3, 5]. Optimal parallel programming schemes enable the application programmer to use the hybrid hardware in a most efficient way, i.e., without any overhead induced by the programming scheme. On distributed memory systems, message passing, especially with MPI [4, 12, 13], has shown to be the mainly used programming paradigm. One reason of the success of MPI was the clear separation of the optimization: communication could be improved by the MPI library, while the numerics had to be optimized by the compiler. On shared memory systems, directive-based parallelization was standardized with OpenMP [15], but there is also a long history of proprietary compiler-directives for parallelization. The directives handle mainly the work sharing; there is no data distribution.
On hybrid systems, i.e., on clusters of SMP nodes, parallel programming can be done in several ways: one can use pure MPI, or some schemes combining MPI and OpenMP, e.g., calling MPI routines only outside of parallel regions (which is herein named the masteronly style), or using OpenMP on top of a (virtual) distributed shared memory (DSM) system. A classication on MPI and OpenMP based parallel programming schemes on hybrid architectures is given in Sect. 2. Unfortunately, there are several mismatch problems between the (hybrid) programming schemes and the hybrid hardware architectures. Often, one can see in publications, that applications may or may not benet from hybrid programming depending on some application parameters, e.g., in [7, 10, 22].
Sect. 3 gives a list of major problems often causing a degradation of the speed-up, i.e., causing that the parallel hardware is utilized only partially. Sect. 4 shows, that there isn't a silver bullet to achieve an optimal speed-up. Measurements show that dierent hardware platforms are more or less prepared for the hybrid programming models. Sect. 5 discusses optimization strategies to overcome typical drawbacks of the hybrid masteronly style. With these modications, efficiency can be achieved together with the ease of parallel programming on clusters of SMPs. Conclusions are provided in Sec.6.
2 Parallel programming on hybrid systems, a classification
Often, hybrid MPI+OpenMP programming denotes a programming style with OpenMP shared memory parallelization inside the MPI processes (i.e., each MPI process itself has several OpenMP threads) and communicating with MPI between the MPI processes, but only outside of parallel regions. For example, if the MPI parallelization is based on a domain decomposition, the MPI communication mainly exchanges the halo information after each iteration of the outer numerical loop. The numerical iterations itself are parallelized with OpenMP, i.e., (inner) loops inside of the MPI processes are parallelized with OpenMP work-sharing directives. However, this scheme is only one style in a set of dierent hybrid programming methods. This hybrid programming scheme will be named masteronly in the following classification, which is based on the question, when and by which thread(s) the messages are sent between the MPI processes:
1. Pure MPI: each CPU of the cluster of SMP nodes is used for one MPI process. The hybrid system is treated as a at massively parallel processing (MPP) system. The MPI library has to optimize the communication by using shared memory based methods between MPI processes on the same SMP node, and the cluster interconnect for MPI processes on different nodes.
2. Hybrid MPI+OpenMP without overlapping calls to MPI routines with other numerical application code in other threads:
2a. Hybrid masteronly: MPI is called only outside parallel regions, i.e., by the master thread.
2b. Hybrid multiple/masteronly: MPI is called outside the parallel regions of the application code, but the MPI communication is done itself by several CPUs: The thread parallelization of the MPI communication can be done
- automatically by the MPI library routines, or
- explicitly by the application, using a full thread-safe MPI library.
In this category, the non-communicating threads are sleeping (or executing some other applications, if non-dedicated nodes are used). This problem of idling CPUs is solved in the next category.
3. Overlapping communication and computation: While the communication is done by the master thread (or a few threads), all other noncommunicating threads are executing application code. This category requires, that the application code is separated into two parts: the code that can be overlapped with the communication of the halo data, and the code that must be deferred until the halo data is received. Inside of this category, we can distinguish two types of sub-categories:
- How many threads communicate: (A) Hybrid funneled: Only the master thread calls MPI routines, i.e., all communication is funneled to the master thread. (B) Hybrid multiple: Each thread handles its own communication needs (B1), or the communication is funneled to more than one thread (B2).
- Except in case B1, the communication load of the threads is inherently unbalanced. To balance the load between threads that communicate and threads that do not communicate, the following load balancing strategies can be used: (I) Fixed reservation: reserving a xed amount of threads for communication and using a xed load balance for the application between the communicating and non-communicating threads; or (II) Adaptive.
4. Pure OpenMP: based on virtual distributed shared memory systems (DSM), the total application is parallelized only with shared memory directives.
Each of these categories of hybrid programming has different reasons, why it is not appropriate for some classes of applications or classes of hybrid hardware architectures. The paper focuses on pure MPI and hybrid masteronly programming style. Overlapping communication and computation is studied in more detail in [16, 17]. Regarding pure OpenMP approaches, the reader is referred to [1, 6, 8, 11, 18, 19, 20]. Different SMP parallelization strategies in the hybrid model are studied in [21] and in [2] for the NAS parallel benchmarks. The following section shows major problems of mismatches between programming and hardware architecture.
3 Mismatch problems
All these programming styles on clusters of SMP nodes have advantages, but also serious disadvantages based on mismatch problems between the (hybrid) programming scheme and the hybrid architecture:
- With pure MPI, minimizing of the inter-node communication requires that the application-domain's neighborhood-topology matches with the hardware topology.
- Pure MPI also introduces intra-node communication on the SMP nodes that can be omitted with hybrid programming.
- On the other hand, such MPI+OpenMP programming is not able to achieve full inter-node bandwidth on all platforms for any subset of intercommunicating threads.
- With masteronly style, all non-communicating threads are idling.
- CPU time is also wasted, if all CPUs of an SMP node communicate, although a few CPUs are already able to saturate the inter-node bandwidth.
- With hybrid masteronly programming, additional overhead is induced by all OpenMP synchronization, but also by additional cache ushing between the generation of data in parallel regions and the consumption in subsequent message passing routines and calculations in subsequent parallel sections.
Overlapping of communication and computation is a chance for an optimal usage of the hardware, but
- causes serious programming eort in the application itself to separate numerical code that needs halo data and that cannot be overlapped with the communication therefore,
- causes overhead due to the additional parallelization level (OpenMP), and
- communicating and non-communicating threads must be load balanced.
A few of these problems will be discussed in more detail and based on benchmark results in the following sections.
.
.
.
References
- Rudolf Berrendorf, Michael Gerndt, Wolfgang E. Nagel and Joachim Prumerr, SVM Fortran, Technical Report IB-9322, KFA Jlich, Germany, 1993. www.fz-juelich.de/zam/docs/printable/ib/ib-93/ib-9322.ps
- Frank Cappello and Daniel Etiemble, MPI versus MPI+OpenMP on the IBM SP for the NAS benchmarks, in Proc. Supercomputing'00, Dallas, TX, 2000. http://citeseer.nj.nec.com/cappello00mpi.html
- The Earth Simulator. www.es.jamstec.go.jp
- William Gropp, Ewing Lusk, Nathan Doss, and Anthony Skjellum, A high- performance, portable implementation of the MPI message passing interface standard, in Parallel Computing 22-6, Sep. 1996, pp 789-828. http://citeseer.nj.nec.com/gropp96highperformance.html
- Shinichi Habataa, Mitsuo Yokokawa, and Shigemune Kitawaki, The Earth Sim- ulator System, in NEC Research & Development, Vol. 44, No. 1, Jan. 2003, Special Issue on High Performance Computing.
- Jonathan Harris, Extending OpenMP for NUMA Architectures, in proceedings of the Second European Workshop on OpenMP, EWOMP 2000. www.epcc.ed.ac.uk/ewomp2000/proceedings.html
- D. S. Henty, Performance of hybrid message-passing and shared-memory paral- lelism for discrete element modeling, in Proc. Supercomputing'00, Dallas, TX, 2000. http://citeseer.nj.nec.com/henty00performance.html www.sc2000.org/techpapr/papers/pap.pap154.pdf
- Matthias Hess, Gabriele Jost, Matthias Muller, and Roland Ruhle, Experiences using OpenMP based on Compiler Directed Software DSM on a PC Cluster, in WOMPAT2002: Workshop on OpenMP Applications and Tools, Arctic Region Supercomputing Center, University of Alaska, Fairbanks, Aug. 5-7, 2002. http://www.hlrs.de/ people/mueller/papers/wompat2002/wompat2002.pdf
- Georg Karypis and Vipin Kumar. A parallel algorithm for multilevel graph par- titioning and sparse matrix ordering, Journal of Parallel and Distributed Computing, 48(1): 71-95, 1998. http://www-users.cs.umn.edu/~karypis/metis/, http://citeseer.nj.nec.com/karypis98parallel.html
- R. D. Loft, S. J. Thomas, and J. M. Dennis, Terascale spectral element dynam- ical core for atmospheric general circulation models, in proceedings, SC 2001, Nov. 2001, Denver, USA. www.sc2001.org/papers/pap.pap189.pdf
- John Merlin, Distributed OpenMP: Extensions to OpenMP for SMP Clusters, in proceedings of the Second European Workshop on OpenMP, EWOMP 2000. www.epcc.ed.ac.uk/ewomp2000/proceedings.html
- Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Rel. 1.1, June 1995, www.mpi-forum.org.
- Message Passing Interface Forum. MPI-2: Extensions to the Message-Passing Interface, July 1997, www.mpi-forum.org.
- Hans Meuer, Erich Strohmaier, Jack Dongarra, Horst D. Simon, Universities of Mannheim and Tennessee, TOP500 Supercomputer Sites, www.top500.org.
- OpenMP Group, www.openmp.org.
- Rolf Rabenseifner and Gerhard Wellein, Communication and Optimization As- pects of Parallel Programming Models on Hybrid Architectures, International Journal of High Performance Computing Applications, Sage Science Press, Vol. 17, No. 1, 2003, pp 49-62.
- Rolf Rabenseifner, Hybrid Parallel Programming: Performance Problems and Chances, in proceedings of the 45th CUG Conference 2003, Columbus, Ohio, USA, May 12-16, 2003, www.cug.org.
- Mitsuhisa Sato, Shigehisa Satoh, Kazuhiro Kusano, and Yoshio Tanaka, De- sign of OpenMP Compiler for an SMP Cluster, in proceedings of the 1st EuropeanWorkshop on OpenMP (EWOMP'99), Lund, Sweden, Sep. 1999, pp 32-39. http://citeseer.nj.nec.com/sato99design.html
- A. Scherer, H. Lu, T. Gross, and W. Zwaenepoel, Transparent Adaptive Paral- lelism on NOWs using OpenMP, in proc. of the Seventh Conference on Principles and Practice of Parallel Programming (PPoPP '99), May 1999, pp 96-106.
- Weisong Shi, Weiwu Hu, and Zhimin Tang, Shared Virtual Memory: A Sur- vey, Technical report No. 980005, Center for High Performance Computing, Institute of Computing Technology, Chinese Academy of Sciences, 1998, www.ict.ac.cn/chpc/dsm/tr980005.ps.
- Lorna Smith and Mark Bull, Development of Mixed Mode MPI / OpenMP Applications, in proceedings of Workshop on OpenMP Applications and Tools (WOMPAT 2000), San Diego, July 2000. www.cs.uh.edu/wompat2000/
- Gerhard Wellein, Georg Hager, Achim Basermann, and Holger Fehske, Fast sparse matrix-vector multiplication for TeraFlop/s computers, in proceedings of VECPAR'2002, 5th Int'l Conference on High Performance Computing and Computational Science, Porto, Portugal, June 26-28, 2002, part I, pp 57-70. http://vecpar.fe.up.pt