Parallel Solvers for Dense Linear Systems for Heterogeneous Computational clusters

Ravi Reddy
School of Computer Science and Informatics University College Dublin
manumachu.reddy@ucd.ie


Alexey Lastovetsky
School of Computer Science and Informatics University College Dublin
alexey.lastovetsky@ucd.ie


Pedro Alonso
Department of Information Systems and Computation Polytechnic University of Valencia
palonso@dsic.upv.es

Abstract

This paper describes the design and the implementation of parallel routines in the Heterogeneous ScaLAPACK library that solve a dense system of linear equations. This library is written on top of HeteroMPI and ScaLAPACK whose building blocks, the de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) and message passing communication (BLACS), are optimized for heterogeneous computational clusters.

We show that the efficiency of these parallel routines is due to the most important feature of the library, which is the automation of the difficult optimization tasks of parallel programming on heterogeneous computing clusters. They are the determination of the accurate values of the platform parameters such as the speeds of the processors and the latencies and bandwidths of the communication links connecting different pairs of processors, the optimal values of the algorithmic parameters such as the total number of processes, the 2D process grid arrangement and the efficient mapping of the processes executing the parallel algorithm to the executing nodes of the heterogeneous computing cluster.

We describe this process of automation followed by presentation of experimental results on a local heterogeneous computing cluster demonstrating the efficiency of these solvers.

1. Introduction

This paper describes the design and the implementation of the parallel routines in the Heterogeneous ScaLAPACK library that solve a dense system of linear equations. These routines compute the solution to a (real, complex) system of linear equations A×X=B, where A, X, and B are distributed matrices. We start with presentation of related works, which have produced solely heterogeneous parallel algorithms/strategies.

There are two strategies of distribution of computations that can be used to implement parallel solvers for dense linear systems for Heterogeneous Computing Clusters (HCCs) [1].

  • HeHo - heterogeneous distribution of processes over processors and homogeneous block distribution of data over the processes. This is implemented in mpC language [2,3] with calls to ScaLAPACK [4];
  • HoHe - homogeneous distribution of processes over processors with each process running on a separate processor and heterogeneous block cyclic distribution of data over the processes. This is implemented in mpC language with calls to BLAS [4] and LAPACK[4].

The HeHo strategy is a multiprocessing approach that is commonly used to accelerate ScaLAPACK programs on HCCs. The strategies are compared using Cholesky factorization on a HCC. The authors show that for HCCs, the strategies HeHo and HoHe are more efficient than the traditional homogeneous strategy HoHo (homogeneous distribution of processes over processors with each process running on a separate processor and homogeneous block cyclic distribution of data over the processes implemented in ScaLAPACK). The main disadvantage of the HoHe strategy is non-Cartesian nature of the data distribution, which is the distribution where each processor has to communicate with more than four direct neighbors. This leads to additional communications that can be crippling in the case of large networks. Moreover, the non-Cartesian nature leads to nonscalability of the linear algebra algorithms that are proven to be scalable in the case of Cartesian data distribution and networks allowing parallel communications.

Beaumont et al. [5] present a proposal motivating provision of parallel solvers for dense linear systems in the form of a library for HCCs. The authors discuss HoHe strategies to implement matrix products and dense linear system solvers for HCCs and present them as a basis for a successful extension of the ScaLAPACK library to HCCs. They show that extending the standard ScaLAPACK block-cyclic distribution to heterogeneous 2D grids is a difficult task. However, providing this extension is only the first step. The ScaLAPACK software must then be completely redesigned and rewritten, which is not a trivial effort.

The multiprocessing strategy (HeHo), however, is easier to accomplish. It allows the complete reuse of high-quality software such as ScaLAPACK on HCCs with no redesign efforts and provides good speedups. It can be summarized as follows:

  • The whole computation is partitioned into a large number of equal chunks;
  • Each chunk is performed by a separate process;
  • The number of processes run by each processor is proportional to its speed.

Thus, while distributed evenly across parallel processes, data and computations are distributed unevenly over processors of the HCC so that each processor performs the volume of computations proportional to its speed.

There have been several research contributions illuminating the merits of this strategy. Kishimoto and Ichikawa [6] use it to estimate the best processing element (PE) configuration and process allocation based on an execution-time model of the application. Kalinov and Klimov [7] investigate the HeHo strategy where the performance of the processor is given as a function of the number of processes running on the processor and the amount of data distributed to the processor. They present an algorithm that computes optimal number of processes and their distribution over processors minimizing the execution time of the application. Cuenca et al. [8] analyze automatic optimization techniques in the design of parallel dynamic programming algorithms on heterogeneous platforms. The main idea is to automatically determine the optimal values of a number of algorithmic parameters such as number of processes, number of processors, and number of processes per processor.

So there has been a proliferation of research efforts presenting approaches to implement parallel solvers for HCCs but scientific software based on these approaches is virtually nonexistent. However, from these research efforts, one can conclude that any software aspiring to provide automatically tuned parallel linear algebra programs for HCCs must automate the following essential and complicated tasks, which are also described in [9]:

  • Determination of the accurate values of platform parameters such as speeds of the processors, latencies and bandwidths of the communication links connecting different pairs of processors. These parameters compose the performance model of the executing heterogeneous platform;
  • Determination of the optimal values of algorithmic parameters such as data distribution blocking factor and 2D process grid arrangement for linear algebra kernel, and
  • Efficient mapping of the processes executing the parallel algorithm to the executing nodes of the HCC.

The Heterogeneous ScaLAPACK library [10] is one such software package, currently under development, providing automatically tuned parallel linear algebra programs for HCCs. It performs all the aforementioned automations. It uses the multprocessing strategy (HeHo) and is built on top of HeteroMPI [11] and ScaLAPACK.

In this paper, we describe the efforts made to implement parallel routines solving dense linear systems in this library. We present details of the implementation of one of the solvers, PDGESV, which computes the solution to a real system of linear equations. At the same time, we also explain how the optimal values of the algorithmic parameters such as the data distribution blocking factor and 2D process grid arrangement are determined and how efficient mapping of the processes to the executing nodes of the HCC is accomplished.

The rest of the paper is organized as follows. We start with an overview of the software in the Heterogeneous ScaLAPACK package. In section 3, we explain how the optimal values of the algorithmic parameters are determined. In section 4, we present the experimental results of execution of a couple of Heterogeneous ScaLAPACK programs on a local HCC. We conclude the paper by outlining our future research goals.

2. Determination of the Optimal Algorithmic Parameters

The principal routines in Heterogeneous ScaLAPACK package are the context creation functions for the ScaLAPACK routines. There is a context creation function for each and every ScaLAPACK routine. It provides a context for the execution of the ScaLAPACK routine, but most importantly performs the critical work of automating the difficult optimization tasks. All the context creation routines have names of the form hscal_pxyyzzz_ctxt. The context creation/destruction routines call interface functions of HeteroMPI runtime system (the main routines being HMPI_Recon, HMPI_Timeof, HMPI_Group_auto_create).

The first step in the implementation of the context creation routine for a ScaLAPACK routine is the writing of its performance model using the HeteroMPI's Performance Model Definition Language (PMDL), which is a subset of mpC.

The writing of performance models of all the ScaLAPACK routines solving a dense system of linear equations (PxGESV, PxPOSV) has been the most laborious and intricate effort of this project. The key design issues were (a) accuracy, to facilitate accurate prediction of the execution time of the ScaLAPACK routine, (b) efficiency, to execute the performance model in reasonable execution time, (c) reusability, as these performance models are to be used as building blocks for the performance models of ScaLAPACK routines, and (d) preservability, to preserve the key design features of underlying ScaLAPACK package.

The reader is referred to the Heterogeneous ScaLAPACK programmer's manual for more details of the user interface. The complete performance model descriptions of the routines PxGESV and PxPOSV can be studied from the files pm_pxgesv.mpc and pm_pxposv.mpc in the directory /SRC of the Heterogeneous ScaLAPACK package available at the URL [10].

The PDGESV context constructor routine hscal_pdgesv_ctxt is the main function automating the difficult optimization tasks of parallel programming on HCCs. They are the determination of the accurate values of the platform parameters such as the speeds of the processors and the latencies and bandwidths of the communication links connecting different pairs of processors, the optimal values of the algorithmic parameters such as the 2D process grid arrangement and efficient mapping of the processes executing the parallel algorithm to the executing nodes of the HCC.

.

.

.

Acknowledgement

The research was supported by Science Foundation of Ireland (SFI). Pedro Alonso wishes to acknowledge the support provided by Vicerrectorado de Investigación, Desarrollo e Innovación de la Universidad Politécnica de Valencia, and Generalitat Valenciana.

References

  1. A. Kalinov and A. Lastovetsky, "Heterogeneous Distribution of Computations Solving Linear Algebra Problems on Networks of Heterogeneous Computers," Journal of Parallel and Distributed Computing, 61(4), pp.520-535, April 2001.
  2. A. Lastovetsky, D. Arapov, A. Kalinov, and I. Ledovskih, "A Parallel Language and Its Programming System for Heterogeneous Networks," Concurrency: Practice and Experience, 12(13), pp.1317-1343, November 2000.
  3. A. Lastovetsky, "Adaptive Parallel Computing on Heterogeneous Networks with mpC," Parallel Computing, 28(10), pp.1369-1407, October 2002.
  4. Netlib repository. http://www.netlib.org.
  5. O. Beaumont, V. Boudet, A. Petitet, F. Rastello, and Y. Robert, "A Proposal for a Heterogeneous Cluster ScaLAPACK (Dense Linear S8lvers)," IEEE Transactions on Computers, 50(10), pp.1052-1070, October 2001.
  6. Y. Kishimoto and S. Ichikawa, "An Execution-Time Estimation Model for Heterogeneous Clusters," In Proceedings of 18th International Parallel and Distributed Processing Symposium, IEEE Computer Society (2004).
  7. A. Kalinov and S. Klimov, "Optimal mapping of a parallel application processes onto heterogeneous platform," In Proceedings of 19th International Parallel and Distributed Processing Symposium, IEEE Computer Society (2005).
  8. J. Cuenca, D. Giménez, and J-P. Martinez, "Heuristics for work distribution of a homogeneous parallel dynamic programming scheme on heterogeneous systems," Parallel Computing, 31(7), pp.711-735, Elsevier, 2006.
  9. A. Lastovetsky, "Scientific Programming for Heterogeneous Systems - Bridging the Gap between Algorithms and Applications," Proceedings of the 5th International Symposium on Parallel Computing in Electrical Engineering, IEEE Computer Society (2006).
  10. Heterogeneous ScaLAPACK. http://hcl.ucd.ie/project/HeteroScaLAPACK/.
  11. A. Lastovetsky and R. Reddy, "HeteroMPI: Towards a Message-Passing Library for Heterogeneous Networks of Computers," Journal of Parallel and Distributed Computing (JPDC), 66(2), pp.197-220, Elsevier, 2006.