Computer Science and Engineering, Department of CSE Journal Articles

University of Nebraska - Lincoln Year 2010

Communication-Aware Load Balancing for Parallel Applications on Clusters

Xiao Qin, Hong Jiangt, Adam Manzanares, Xiaojun Ruan, Shu Yin

*Auburn University, xqin@auburn.edu tUniversity of Nebraska - Lincoln, jiang@cse.unl.edu Auburn University, acm0008@auburn.edu "Auburn University, xzr001@auburn.edu ttAuburn University, szy0001@auburn.edu This paper is posted at DigitalCommons@University of Nebraska - Lincoln. http://digitalcommons.unl.edu/csearticles/45

Abstract—Cluster computing has emerged as a primary and cost-effective platform for running parallel applications, including communication-intensive applications that transfer a large amount of data among the nodes of a cluster via the interconnection network. Conventional load balancers have proven effective in increasing the utilization of CPU, memory, and disk I/O resources in a cluster. However, most of the existing load-balancing schemes ignore network resources, leaving an opportunity to improve the effective bandwidth of networks on clusters running parallel applications. For this reason, we propose a communication-aware load-balancing technique that is capable of improving the performance of communication-intensive applications by increasing the effective utilization of networks in cluster environments. To facilitate the proposed load-balancing scheme, we introduce a behavior model for parallel applications with large requirements of network, CPU, memory, and disk I/O resources. Our load-balancing scheme can make full use of this model to quickly and accurately determine the load induced by a variety of parallel applications. Simulation results generated from a diverse set of both synthetic bulk synchronous and real parallel applications on a cluster show that our scheme significantly improves the performance, in terms of slowdown and turn-around time, over existing schemes by up to 206 percent (with an average of 74 percent) and 235 percent (with an average of 82 percent), respectively.

1 Introduction

Significant cost advantages, combined with rapid advances in middleware and interconnect technologies, have made clusters a primary and fast growing platform for high-performance scientific computing. Scheduling [6] and load balancing [18], [19] are two key techniques used to improve the performance of clusters for scientific applications by fully utilizing machines with idle or underutilized resources. A number of distributed load-balancing schemes for clusters have been developed, primarily considering a variety of resources, including the CPU [13], memory [1], disk I/O [18], or a combination of CPU and memory resources [31]. These approaches have proven effective in increasing the utilization of resources in clusters, assuming that network interconnects are not potential bottlenecks in clusters. However, a recent study demonstrated that the need to move data from one component to another in clusters is likely to result in a major performance bottleneck [10], [23], indicating that data movement through the interconnection of a cluster can become a primary bottleneck. Thus, if the opportunity for improving effective bandwidth of networks is fully exploited, the performance of parallel jobs on clusters could be enhanced.

A large number of scientific applications have been implemented for execution on clusters and more are underway. Many scientific applications are inherently computationally and communicationally intensive [8]. Examples of such applications include 3D perspective rendering, molecular dynamics simulation, quantum chemical reaction dynamics simulations, and 2D fluid flow using the vortex method, to name just a few [8]. The above bottleneck becomes more severe if the resources of a cluster are time/ space-shared among multiple scientific applications and the communication load is not evenly distributed among the cluster nodes. Furthermore, the performance gap between the effective speed of CPU and network resources continues to grow at a faster pace, which raises a need for increasing the utilization of networks on clusters using various techniques.

The main motivation for our study is to improve the efficiency and usability of networks in cluster computing environments with high communication demands. The effort to improve the utilization of networks can be classified into hardware and software approaches. In this paper, we focus on an approach designed at the software level. In particular, we develop a communication-aware load-balan-cing scheme (referred to as COM-aware) to achieve high effective bandwidth communication without requiring any additional hardware. Our approach can substantially improve the performance of parallel applications running on a large-scale, time/space-shared cluster, where a number of parallel jobs share various resources. COM-aware load balancing enables a cluster to utilize most idle, or underutilized, network resources while keeping the usage of other types of resources reasonably high. We propose a behavioral

model for parallel applications to capture the typical characteristics of various requirements of CPU, memory, network, and disk 1/O resources.

Typical load-balancing approaches that only focus on maximizing the utilization of one type of resource are not sufficient for a wide range of applications. Increasing evidence shows that high-performance computing requires clusters to be capable of executing multiple types of applications submitted by users [20], [26], [31] simultaneously. For this reason, the COM-aware scheme is designed in a way that delivers high performance under a large variety of workload scenarios, in order to provide one possible solution to the problem.

The rest of the paper is organized as follows: Section 2 presents a summary of related work. Section 3 introduces an application behavioral model and a system model that aim at capturing the resource requirements for applications and the simulated cluster environment. Section 4 presents our communication-aware load-balancing scheme. Section 5 introduces the simulation model and the methods used for gathering the data for performance evaluation. Section 6 summarizes the performance evaluation of the proposed scheme by simulating a cluster running both synthetic bulk synchronous and real-world communication-intensive parallel applications. Finally, Section 7 summarizes the main contributions of this paper and comments on future research.

References

[1] A. Acharya and S. Setia, "Availability and Utility of Idle Memory in Workstation Clusters/' Proc. ACM SIGMETRICS, pp. 35-46, 1999.

[2] N.J. Boden, D. Cohen, R.E. Felderman, A.E. Kulawik, C.L. Seitz, J.N. Seizovic, and W.-K. Su, "Myrinet: A Gigabit-Per-Second Local Area Network," IEEE Micro, vol. 15, no. 1, pp. 29-36, Feb. 1995.

[3] R. Brightwell, B. Lawry, A.B. MacCabe, and R. Riesen, "Portals 3.0: Protocol Building Blocks for Low Overhead Communication," Proc. 16th Int'l Parallel and Distributed Processing Symp. (IPDPS '02), pp. 164-173, 2002.

[4] D. Buntinas, D.K. Panda, and R. Brightwell, "Application-Bypas Broadcast in MPICH over GM," Proc. Third Int'l Symp. Cluster Computing and the Grid (CCGRID '03), pp. 2-9, 2003.

[5] W. Cirne and F. Berman, "When the Herd Is Smart: Aggregate Behavior in the Selection of Job Request," IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 2, pp. 181-192, Feb. 2003.

[6] J. Cohen, E. Jeannot, N. Padoy, and F. Wagner, "Messages Scheduling for Parallel Data Redistribution between Clusters," IEEE Trans. Parallel and Distributed Systems, vol. 17, no. 10, pp. 1163-1175, Oct. 2006.

[7] J. Cruz and K. Park, "Towards Communication-Sensitive Load Balancing," Proc. 21st Int'l Conf. Distributed Computing Systems, pp. 731-734, Apr. 2001.

[8] R. Cypher, A. Ho, S. Konstantinidou, and P. Messina, "Architectural Requirements of Parallel Scientific Applications with Explicit Communication," Proc. 20th Ann. Int'l Symp. Computer Architecture (ISCA '93), pp. 2-13, 1993.

[9] A.C. Dusseau, R.H. Arpaci, and D.E. Culler, "Effective Distributed Scheduling of Parallel Workloads," Proc. ACM SIGMETRICS, pp. 25-36, 1996.

[10] W.-C. Feng, J. (Gus) Hurwitz, H. Newman, S. Ravot, RL. Cottrell, O. Martin, F. Coccetti, C. Jin, X. (David) Wei, and S. Low, "Optimizing 10-Gigabit Ethernet for Networks of Workstations, Clusters, and Grids: A Case Study," Proc. 2003 ACM/IEEE Conf. Supercomputing (SC '03), p. 50, 2003.

[11] P. Geoffray, "OPIOM: Off-Processor I/O with Myrinet," Future Generation Computer Systems, vol. 18, no. 4, pp. 491-499, 2002.

[12] W. Grop and E. Lusk, "The Message Passing Interface(MPI) Standard," Argonne Nat'l Lab, 2001.

[13] M. Harchol-Balter and A.B. Downey, "Exploiting Process Lifetime Distributions for Dynamic Load Balancing," ACM Trans. Computer Systems, vol. 15, no. 3, pp. 253-285, 1997.

[14] C.H. Hsu and J.W. Liu, "Dynamic Load Balancing Algorithms in Homogeneous Distributed Systems," Proc. Sixth Int'l Conf. Distributed Computing Systems, pp. 216-223, May 1986.

[15] R. Lavi and A. Barak, "The Home Model and Competitive Algorithms for Load Balancing in a Computing Cluster," Proc. 21st Int'l Conf. Distributed Computing Systems (ICDCS '01), pp. 127-134, 2001.

[16] P. Li and D. Curkendall, "Parallel 3-D Perspective Rendering," Proc. First Int'l Delta Applications Workshop, pp. 52-58, 1992.

[17] J.M. Orduna, V. Arnau, A. Ruiz, R. Valero, and J. Duato, "On the Design of Communication-Aware Task Scheduling Strategies for Heterogeneous Systems," Proc. 2000 Int'l Conf. Parallel Processing (ICPP '00), pp. 391-398, 2000.

[18] X. Qin, "Design and Analysis of a Load Balancing Strategy in Data Grids," Future Generation Computer Systems, vol. 23, no. 1, pp. 132-137, 2007.

[19] X. Qin, "Performance Comparisons of Load Balancing Algorithms for I/O-Intensive Workloads on Clusters," ]. Network Computer Application, vol. 31, no. 1, pp. 32-46, 2008.

[20] X. Qin, H. Jiang, Y. Zhu, and D. Swanson, "Towards Load Balancing Support for 1/O-Intensive Parallel Jobs in a Cluster of Workstations," Proc. IEEE Int'l Conf. Cluster Computing, Dec. 2003.

[21] X. Qin, Y. Jiang, H. Zhu, and D. Swanson, "Dynamic Load Balancing for I/O-Intensive Tasks on Heterogeneous Clusters," Proc. 10th Int'l Conf. High Performance Computing (HiPC '03), Dec. 2003.

[22] K.D. Ryu and J.K. Hollingsworth, "Exploiting Fine-Grained Idle Periods in Networks of Workstations," IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 7, pp. 683-698, July 2000.

[23] L. Schaelicke and A.L. Davis, "Design Trade Offs for User-Level I/O Architectures," IEEE Trans. Computers, vol. 55, no. 8, pp. 962-973, Aug. 2006.

[24] K.G. Shin and Y.-C. Chang, "Load Sharing in Distributed Real-Time Systems with State-Change Broadcasts," IEEE Trans. Computers, vol. 38, no. 8, pp. 1124-1142, Aug. 1989.

[25] J.A. Stankovic, "Simulations of Three Adaptive, Decentralized Controlled, Job Scheduling Algorithms," Computer Networks and ISDN Systems, vol. 8, no. 3, pp. 199-217, 1984.

[26] M. Surdeanu, D.I. Moldovan, and S.M. Harabagiu, "Performance Analysis of a Distributed Question/Answering System," IEEE Trans. Parallel and Distributed Systems, vol. 13, no. 6, pp. 579-596, June 2002.

[27] L.G. Valiant, "A Bridging Model for Parallel Computation," Comm. ACM, vol. 33, no. 8, pp. 103-111, 1990.

[28] J.S. Vetter and F. Mueller, "Communication Characteristics of Large-Scale Scientific Applications for Contemporary Cluster Architectures," ]. Parallel and Distributed Computing, vol. 63, no. 9, pp. 853-865, 2003.

[29] J.S. Vetter and F. Mueller, "High Performance Implementation of MPI Datatype Communication over Infiniband," Proc. Int'l Parallel and Distributed Processing Symp., Apr. 2004.

[30] J.S. Vetter and A. Yoo, "An Empirical Performance Evaluation of Scalable Scientific Applications," Proc. 2002 ACM/IEEE Conf. Supercomputing (SC '02), pp 1-18, 2002.

[31] X.-D. Zhang, L. Xiao, and Y.-X. Qu, "Improving Distributed Workload Performance by Sharing Both CPU and Memory Resources," Proc. 20th Int'l Conf. Distributed Computing Systems (ICDCS '00), pp. 233-241, 2000.