Название A Class of OpenMP Applications Involving Nested Parallelism
Источник http://portal.acm.org/citation.cfm?id=967948
Авторы H. Martin Buуcker, Arno Rasch, Andreas Wolf
Abstract
Today, OpenMP is the de facto standard for portable sharedmemory programming supporting multiple levels of parallelism. Unfortunately, most of the current OpenMP implementations are not capable of fully exploiting more than one level of parallelism. With the increasing number of processors available in high-performance computing resources, the number of applications that would benefit from multilevel parallelism is also increasing. Applying automatic di_erentiation to OpenMP programs is introduced as a new class of OpenMP applications with nested parallelism.
INTRODUCTION
Simple, yet powerful programming models are needed for the e_ective utilization of parallel computing resources. OpenMP, in which the available parallelism is expressed by inserting directives into a given sequential code, is the most widely used parallel programming model for shared-memory systems. In the OpenMP standard, the spawning of new parallelism from a previously activated parallel region is explicitly supported. Such multiple levels of parallelism by nesting of parallel directives is called multilevel or nested parallelism. Unfortunately, the current OpenMP standard specifies that nested parallelism is optional and, today, most of the existing OpenMP systems do not su_ciently implement nested parallelism.
The reason for the inadequate support of multilevel parallelism in current OpenMP implementations is twofold. First, it is believed that there is a significant e_ort needed to ef- ficiently implement nested parallelism. Second, there is the impression that there is a lack of potential applications involving multilevel parallelism. However, both arguments will not hold water. There is a lot of work demonstrating that an e_cient implementation of nested parallelism is not that di_cult; see [27] and the references therein. There are also many applications that would benefit from nested parallelism. Most notably, those applications not having a su_ciently high degree of parallelism on the outermost level are primary candidates for nested parallelism. With the large increase of the number of processing elements in nextgeneration parallel computers in which tens of thousands of processors are utilized [1], we expect that class of applications to grow and nested parallelism to become increasingly more important in the not too far future. Furthermore, the importance of nested parallelism is not only reflected by the fact that specific language constructs for nested parallelism have been introduced [17]. Multilevel parallelism is also crucial in the context of using parallel libraries when the calling code is itself parallel.
The purpose of this note is to introduce a whole class of applications that would benefit from nested parallelism, rather than just another single application. We show that any program generated from applying automatic di_erentiation to OpenMP programs involves at least two levels of parallelism. To the best of the authors’ knowledge, this is the first time automatic di_erentiation of OpenMP code is considered in the open literature.
The structure of this note is as follows. In Sec. 2 and Sec. 3, the basics of OpenMP and automatic di_erentiation are briefly summarized. The application of automatic di_erentiation to a code that is parallelized using OpenMP is sketched in Sec. 4. The derivative computations can be parallelized by exploiting an additional level of parallelism. The resulting approach based on nested parallelism is introduced in Sec. 5.
OPENMP
OpenMP [14, 24, 25] is a set of portable compiler directives, library routines, and environment variables that can be used to specify shared-memory parallelism in C/C++ and Fortran programs. The standard defined by a group of computer hardware and software vendors is available via http://www.openmp.org. For the sake of brevity we will focus on a very limited subset of OpenMP directives for Fortran that are relevant for the following discussion. In OpenMP, directives are indicated by C$OMP at the beginning of a line. The directives parallel and end parallel define a so-called parallel region. On entering such a region, additional threads are created that allow parallel execution.
1 C$OMP parallel do private(i,tmp) shared(L,x,y,z) 2 do i=1,L 3 tmp=sin(z(i)) 4 x(i)=y(i)*tmp 5 enddo 6 C$OMP end parallel do
Figure 1: Example code with OpenMP directives for loop parallelism
Unless specified with other directives, each of these threads executes the whole code within the parallel region. There are di_erent methods for controlling the number of threads; see [25] for details. The additional threads are (logically) terminated when the directive end parallel is encountered. Within a parallel region the threads can be used not only to perform redundant identical operations but also for sharing the work of expensive computations, in particular of loops. If a loop in a parallel region is enclosed by the worksharing directives do and end do then its iterations are assigned to the threads according to some scheduling strategy. That is, a parallel construct can be combined with a do directive to execute the iterations of a loop in parallel. These two directives can be combined into the single construct parallel do and end parallel do.
In the context of shared-memory parallelization it is important to distinguish between variables that are shared by all threads, and private variables that require one copy per thread. For example, the loop counter of a parallelized loop must be private because the threads operate on di_erent iterations of the loop, whereas the loop bounds are identical for all threads and therefore may be shared. By default, most of the variables are assumed to be shared, but OpenMP clauses can be used to make some of the variables private. Throughout this note we do not use the default behavior of these scoping clauses but prefer to specify the corresponding scopes, shared and private, explicitly.
To illustrate the above discussion, consider the code fragment depicted in Fig. 1 where the variables x, y and z denote arrays of size L. The parallel do construct indicates that the di_erent iterations can be executed by di_erent threads in parallel. In this example, all the threads within the team have full access to the (global) arrays x, y, and z, but each thread uses its own portion of these arrays. The shared clause is used to specify that a variable can be accessed by the whole team of threads. On the other hand, the private clause indicates that each thread maintains a private copy of the variables i and tmp.
The speedup S of a parallel code computing a function f using p threads is defined by the ratio of the execution time on a single thread, Tf (1), and the execution time on p threads, Tf (p). Amdahl’s law [2] states that the speedup is given by
S(a, p) = 1 / (1 – a + a/p’) (1)
where a is the fraction of the code that can be parallelized. In the code depicted in Fig. 1, there is no sequential part, i.e., a = 1, and a reasonable number of threads should not exceed L.
Finally, OpenMP supports nested parallelism where a parallel region is contained within another parallel region. Because of the lack of current compilers fully implementing this feature, there is not so much experience with multilevel parallelism in OpenMP. Exceptions include the research project NANOS [3, 18, 23].
AUTOMATIC DIFFERENTIATION
Automatic Di_erentiation (AD) comprises a set of techniques for automatically augmenting a computer program with statements for the computation of derivatives. The AD technology is applicable whenever derivatives of functions given in the form of a high-level programming language, such as Fortran, C, or C++, are required. The reader is referred to the books [19, 26] and the proceedings of AD workshops [5, 15, 20] for details on this technique. In automatic di_erentiation the program is treated as a—potentially very long—sequence of elementary operations such as addition or multiplication, for which the derivatives are known. Then the chain rule of di_erential calculus is applied repeatedly, combining these step-wise derivatives to yield the derivatives of the whole program. This mechanical process can be automated, and several AD tools are available for transforming a given code to the new di_erentiated code. Thus, AD requires little human e_ort and produces derivatives without truncation error.
In the forward mode of AD, a derivative object is associated to every variable appearing in the original code. Conceptually, if one is interested in derivatives with respect to N scalar input variables, every scalar intermediate variable of the original code is associated with N additional scalar entries in a corresponding derivative object. Thus, the di_erentiated code consists of the original code, referred to as the function part, and additional statements involving the derivative objects called derivative part hereafter. Take line 3 of the code given in Fig. 1 as an example. The corresponding di_erentiated code for that line is given by
do j=1,N g_tmp(j)=cos(z(i))*g_z(j,i) enddo tmp=sin(z(i))
where the derivative part consists of the loop over the N entries of the derivative objects, g tmp and g z, and the function part is the original statement tmp=sin(z(i)). Hence, the execution time of the di_erentiated code computing the function f together with its Jacobian Jf can be approximated by
TJf _= (1 + cN)Tf (2)
where c is some constant and Tf is the execution time to compute the function f using the original code. We stress that the iterations of a loop over derivative objects are independent of each other.
APPLYING AD TO OPENMP–PARALLELIZED CODE
Despite the urgent need for accurate derivatives in largescale applications on parallel computers, there is only little work on automatic di_erentiation regarding parallelism. A good starting point is the survey [11]. Given a sequential code to which AD is applied, parallelization of the derivative part is considered in [4, 6, 7, 8, 16]. In particular, two di_erent strategies using OpenMP are given in [9, 10].
1 C$OMP parallel do private(i,j,tmp,g_tmp) 2 C$OMP& shared(L,N,x,y,z,g_x,g_y,g_z) 3 do i=1,L 4 do j=1,N 5 g_tmp(j)=cos(z(i))*g_z(j,i) 6 enddo 7 tmp=sin(z(i)) 8 do j=1,N 9 g_x(j,i)=g_y(j,i)*tmp + y(i)*g_tmp(j) 10 enddo 11 x(i) = y(i)*tmp 12 enddo 13 C$OMP end parallel do Figure 2: Augmented code obtained by applying AD to the example code from Sec. 2
However, the application of AD to programs that are already parallel is still in its infancy. AD of message passing programs is considered in [12, 13, 21, 22]. So far, there is no publication of applying AD to OpenMP programs. The code given in Fig. 2 results from applying AD to the example OpenMP code shown in Fig. 1. The function part consists of lines 7 and 11 while the additional statements 4–6 and 8–10 perform the derivative computations. Note that the private and shared directives in lines 1 and 2 also include the new variables related to the derivatives. Obviously, the parallelization strategy of the augmented program is similar to the original code Fig. 1. The derivative objects g x, g y, and g z are shared as the corresponding variables in the original code. Since tmp is private in the original code, the corresponding derivative object g tmp must also be private in the di_erentiated code. Clearly, the iteration variable j for the derivative loops must be private as well. Using (2) together with the definitions from Sec. 2, the execution time of the di_erentiated code utilizing p parallel threads is given by
TJf (p) = (1 - a)(1 + cN)Tf (1) + a (1 + cN)Tf (1) / p
yielding the same potential speedup as given in (1). Here, _ denotes the fraction of the di_erentiated code that corresponds to the parallelized parts of the original code. Currently no AD tool supports OpenMP directives, i.e., they are treated as comments. Hence, the lines 1 and 2 needed to be adapted manually. Since the scoping of the newly introduced variables into private and shared is straightforward there is no reason why AD tools should not be capable of handling programs with OpenMP directives. Nevertheless, special care is necessary for parallel constructs like reduction operations [22].
USING NESTED PARALLELISM FOR DERIVATIVE COMPUTATIONS
Based on the observation given in Sec. 3 stating that the iterations of the derivative loops are independent of each other (and hence parallel), we suggest the following strategy for increasing the level of parallelism. The strategy is illustrated using the same example as in the previous sections. The code resulting from applying this approach is displayed in Fig. 3.
1 C$OMP parallel do private(i,j,tmp,g_tmp) 2 C$OMP& shared(L,N,x,y,z,g_x,g_y,g_z) 3 do i=1,L 4 C$OMP parallel do private(j) 5 C$OMP& shared(N,i,g_tmp,z,g_z) 6 do j=1,N 7 g_tmp(j)=cos(z(i))*g_z(j,i) 8 enddo 9 C$OMP end parallel do 10 tmp=sin(z(i)) 11 C$OMP parallel do private(j) 12 C$OMP& shared(N,i,g_x,y,g_y,tmp,g_tmp) 13 do j=1,N 14 g_x(j,i)=g_y(j,i)*tmp+y(i)*g_tmp(j) 15 enddo 16 C$OMP end parallel do 17 x(i) = y(i)*tmp 18 enddo 19 C$OMP end parallel do
Figure 3: AD code with nested parallelism
The general structure of this code is similar to the one depicted in Fig. 2. Now, each thread t, executing iteration it of the outer loop, spawns a new team of threads for performing the derivative computations in the inner loops which is indicated by the parallel regions within the outer parallel region; see lines 4–9 and lines 11–16 for two instances of nested parallelism. In general, the strategy consists of inserting parallel do and end parallel do directives for each derivative loop.
Although the scoping of the variables in this code with nested parallelism is quite similar to Fig. 2, there are important di_erences. Take the variables g tmp and i as an example. The derivative object g tmp and the outer loop index variable i have private scope in the outer loop, while they are shared by the team of threads spawned by t in the inner loop. The subsequent function evaluations (lines 10 and 17) are performed solely by thread t. This way, multiple teams of threads can work on the derivative loops simultaneously, where each team shares a private copy of g tmp and i and each thread operates on its own portion of the globally shared derivative objects g x, g y, and g z. If the evaluation of the derivatives is computationally expensive, i.e., long gradients are to be computed, and the degree of parallelism in the outer loop is poor, i.e., the number of available processors is greater than L, more threads could be e_ciently employed using the above method. Moreover, all derivative computations can be executed in parallel throughout the whole code, including the serial parts in the original code.
Suppose that p = p1 • p2 threads are available where p1 denotes the number of threads sharing the work in the parallel region specified in the original OpenMP code and p2 denotes the size of one team of threads sharing the work in the derivative loops. Then, the time of the derivative part is divided by p2 throughout the whole code. The execution time of the di_erentiated code is additionally reduced by using p1 threads in those parts which were already parallelized in the original code. Thus, the execution time for evaluating f and its Jacobian reduces to
TJf (p1, p2) = (1-a)(1+cN/p2)Tf (1)+a (1 + cN/p2)Tf (1) / p1
Hence the potential speedup employing p1 • p2 threads in parallel is given by
S(a, p1, p2) = TJf (1) / TJf (p1, p2) = (1 + cN) / ((1 - a)(1 + cN/p2) + a(1 + cN/p2)/ p1 )
To illustrate the potential improvement of nested parallelism over an approach using a single level of parallelism, consider the case where the derivative part dominates the total execution time of the di_erentiated code. That is, we assume 1 _ cN/p2, neglecting the function part. Then, the speedup reduces to
S(a, p1, p2) = 1 / ((1 - a)/p2 + a/(p1p2))
Compared to (1), this speedup is larger if p2 > 1. More importantly, recall that a reasonable number of threads using a single level of parallelism, i.e., p = p1 should not exceed the number of loop iterations, L. Employing two-level parallelism, however, enables the utilization of p = p1*p2 threads with p1 < == L and p2 < == N.
Finally we would like to point out that AD tools typically have access to all the information that is necessary in order to generate the additional OpenMP directives for nested parallel execution of the derivative loops. As explained in Sec. 4, AD tools could also be able to perform the required modifications of the OpenMP directives already existing in the original code. Consequently, AD tools can potentially transform a given parallel code into a parallel differentiated code where the degree of parallelism is increased with respect to the original code. This way, the parallelization strategy of the original program is retained in the di_erentiated code, and the extra work caused by the additional derivative statements could potentially be divided among multiple threads in another level of parallelism.
CONCLUDING REMARKS
Applying automatic di_erentiation to OpenMP programs is identified as a new class of OpenMP applications involving two-level parallelism in a completely automatic way. This semantic program transformation preserves the parallelization strategy of the original program in the di_erentiated code. Consequently, the parallel scalability of the original and di_erentiated OpenMP programs are similar. Moreover, the additional derivative computations in the di_erentiated code introduce another level of parallelism in a natural way, allowing to utilize more processing elements, and thus potentially save some of the extra execution time required for evaluating the derivatives. The necessary insertions and modifi- cations of the OpenMP directives could in principle be performed by an AD tool along with the code transformation related to the derivatives, generating ready-to-use parallel di_erentiated code without any additional work done by the user.
Acknowledgments
The authors would like to thank Dieter an Mey, Center for Computing and Communication, Aachen University, for fruitful discussions regarding the OpenMP standard. This work has been performed with funding by the Deutsche Forschungsgemeinschaft in the Collaborative Research Centre SFB 401 “Flow Modulation and Fluid-Structure Interaction at Airplane Wings” of Aachen University, Aachen, Germany.
REFERENCES
[1] N. R. Adiga et al. An overview of the BlueGene/L supercomputer. In Proceedings of the 2002 ACM/IEEE Conference on Supercomputing, Baltimore, MD, USA, November 16–22, 2002.
[2] G. M. Amdahl. Validity of a single processor approach to achieving large scale computer capabilities. In Proceedings of AFIPS Spring Joint Comput. Conf., volume 30, pages 438–485, Atlantic City, NJ, 1967.
[3] E. Ayguad?e, X. Martorell, J. Labarta, M. Gonz`alez, and N. Navarro. Exploiting multiple levels of parallelism in OpenMP: A case study. In Proceedings of the International Conference on Parallel Processing, ICPP 1999, Wakamatsu, Japan, September 21–24, 1999, pages 172–180. IEEE Computer Society, 1999.
[4] J. Benary. Parallelism in the reverse mode. In Berz et al. [5], pages 137–148.
[5] M. Berz, C. Bischof, G. Corliss, and A. Griewank, editors. Computational Di_erentiation: Techniques, Applications, and Tools. SIAM, Philadelphia, 1996.
[6] C. Bischof, A. Griewank, and D. Juedes. Exploiting parallelism in automatic di_erentiation. In E. Houstis and Y. Muraoka, editors, Proceedings of the 1991 International Conference on Supercomputing, pages 146–153, New York, 1991. ACM Press.
[7] C. H. Bischof. Issues in parallel automatic di_erentiation. In A. Griewank and G. Corliss, editors, Automatic Di_erentiation of Algorithms, pages 100–113, Philadelphia, PA, 1991. SIAM.
[8] H. M. B?ucker, K. R. Buschelman, and P. D. Hovland. A Matrix-Matrix Multiplication Approach to the Automatic Di_erentiation and Parallelization of Straight-Line Codes. In U. Brinkschulte, K.-E. Gro?pietsch, C. Hochberger, and E. W. Mayr, editors, Workshop Proceedings of the International Conference on Architecture of Computing Systems ARCS 2002, Germany, April 8–12, 2002, pages 203–210, Berlin, 2002. VDE Verlag.
[9] H. M. B?ucker, B. Lang, D. an Mey, and C. H. Bischof. Bringing Together Automatic Di_erentiation and OpenMP. In Proceedings of the 15th ACM International Conference on Supercomputing, Sorrento, Italy, June 17–21, 2001, pages 246–251, New York, 2001. ACM Press.
[10] H. M. B?ucker, B. Lang, A. Rasch, C. H. Bischof, and D. an Mey. Explicit Loop Scheduling in OpenMP for Parallel Automatic Di_erentiation. In J. N. Almhana and V. C. Bhavsar, editors, Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications, Moncton, NB, Canada, June 16–19, 2002, pages 121–126, Los Alamitos, CA, 2002. IEEE Computer Society.
[11] C. H. Bischof and P. D. Hovland. Automatic Di_erentiation: Parallel Computation. In C. A. Floudas and P. M. Pardalos, editors, Encyclopedia of Optimization, volume I, pages 102–108. Kluwer Academic Publishers, Dordrecht, The Netherlands, 2001.
[12] A. Carle. Automatic Di_erentiation. In J. Dongarra, I. Foster, G. Fox, W. Gropp, K. Kennnedy, L. Torczon, and A. White, editors, Sourcebook of Parallel Computing, pages 701–719. Morgan Kaufmann, San Francisco, CA, 2003.
[13] A. Carle and M. Fagan. Automatically Di_erentiating MPI-1 Datatypes: The Complete Story. In Corliss et al. [15], pages 215–222.
[14] R. Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, and R. Menon. Parallel Programming in OpenMP. Morgan Kaufmann Publishers, San Francisco, CA, 2001.
[15] G. Corliss, C. Faure, A. Griewank, L. Hasco?et, and U. Naumann, editors. Automatic Di_erentiation of Algorithms: From Simulation to Optimization. Springer, New York, 2002.
[16] H. Fischer. Automatic di_erentiation: Parallel computation of function, gradient and Hessian matrix. Parallel Computing, 13:101–110, 1990.
[17] J. A. Gonz?alez, C. Le?on, C. Rodr?yguez, and F. Sande. Exploiting nested independent FORALL loops in distributed memory machines. In Proceedings of the Third European Workshop on OpenMP – EWOMP’01, Barcelona, Spain, September 8–9, 2001, 2001.
[18] M. Gonz`alez, E. Ayguad?e, X. Martorell, J. Labarta, N. Navarro, and J. Oliver. NanosCompiler: Supporting flexible multilevel parallelism in OpenMP. Concurrency: Practice and Experience, 12(12):1205–1218, 2000.
[19] A. Griewank. Evaluating Derivatives: Principles and Techniques of Algorithmic Di_erentiation. SIAM, Philadelphia, 2000.
[20] A. Griewank and G. Corliss. Automatic Di_erentiation of Algorithms. SIAM, Philadelphia, 1991.
[21] P. Hovland. Automatic Di_erentiation of Parallel Programs. PhD thesis, University of Illinois at Urbana-Champaign, Urbana, IL, USA, 1997.
[22] P. D. Hovland and C. H. Bischof. Automatic di_erentiation of message-passing parallel programs. In Proceedings of the First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pages 98–104, Los Alamitos, CA, 1998. IEEE Computer Society Press.
[23] H. Jin, G. Jost, J. Yan, E. Ayguade, M. Gonzalez, and X. Martorell. Automatic multilevel parallelization using OpenMP. In Proceedings of the Third European Workshop on OpenMP – EWOMP’01, Barcelona, Spain, September 8–9, 2001, 2001.
[24] OpenMP Architecture Review Board. OpenMP C and C++ Application Program Interface, Version 1.0, 1998.
[25] OpenMP Architecture Review Board. OpenMP Fortran Application Program Interface, Version 2.0, 1999.
[26] L. B. Rall. Automatic Di_erentiation: Techniques and Applications, volume 120 of Lecture Notes in Computer Science. Springer Verlag, Berlin, 1981.
[27] Y. Tanaka, K. Taura, M. Sato, and A. Yonezawa. Performance evaluation of OpenMP applications with nested parallelism. In S. Dwarkadas, editor, Languages, Compilers, and Run-Time Systems for Scalable Computers: Proceedings of 5th International Workshop, LCR 2000, Rochester, NY, USA, May 25-27, 2000, volume 1915 of Lecture Notes in Computer Science, pages 100–112, Berlin, 2000. Springer.
|