The next wave in high performance computing is superclusters. By combining commodity 08 the shelfparts with a high bundwidth, low latency interconnect, the Albuquerque High Performance Computing Center at The University of New Mexico created a high performance computer at a fraction of the cost of a traditional supercomputer. The Center, along with the National Computational Science Alliance, has proven the concept with its open production Linux supercluster, Roadrunner, and is currently designing a new supercluster with at least 5 I2 processors. Before the implementation of Roadrunner, it was unknown how such a machine would perform in a production environment. In addition, the National Computational Science Alliance and others are expected to propose building Terascale Linux superclusters with thousands of nodes. The Center is also participating in several research projects, including a collaboration with IBM called Vista Aziil, ro analyze and improve the capabilities and performance of these superclusters. With both production and research superclusters, there are opportunities to study and apply lessons to improve each environment. These new superclusters are integrated nationally and internationally through the Globus Grid Infrastructure arid the Virtual Machine Room.
The Albuquerque High Performance Computing Center (AHPCC), along with the National Computational Science Alliance (the Alliance), has explored the future of high-end computing by implementing several production and research Linux superclusters. With Roadrunner, a 128 processor (dual SMP nodes) machine with a high bandwidth, low latency interconnect, a production resource for the Alliance was created. After this success, a 5 12- processor supercluster is being built to satisfy the scientific and engineering needs of U.S. National Science Foundation (NSF) researchers. Along with these clusters for production computing, the AHPCC has entered into a research partnership with IBM and others to explore how to create better clusters for computation and visualization. With the advent of the Globus Grid Infrastructure and Virtual Machine Room, computational grids, Linux superclusters are becoming a national and international resource for scientific computing.
Over the next decade, clusters will span the entire spectrum of high-performance computing platforms. Instead of being restricted solely to supercomputers for achieving high-performance computing, the spectrum of available platforms will extend from Beowulf-class clusters to superclusters, and on to clusters of supercomputers. Moving along this continuum represents increasing computational power, interconnect speed and possibly other resources, which in turn allows the capability for solving larger classes of applications or instances of problems. Compute-bound applications on workstations often take advantage of the extra horsepower provided by symmetric multiprocessors. Applications with even larger computational requirements and a high-degree of concurrence may use Beowulf clusters, that is, clusters that are home-built from massmarket, commercial off-the-shelf (COTS) workstation and networks.
An example of a Beowulf cluster is a collection of PCs interconnected by Fast Ethernet. Excellent results for certain of these application programs are shown using the Condor [l] software and management tools on such clusters.
A supercluster may contain commodity SMP nodes interconnected with scalable, low-latency, high-bandwidth networks, and is generally integrated by a system vendor rather than the computational scientist. Applications with reater complexity, for example, grand challenges from the sciences, demand scalable systems with high-performance interconnects, perhaps coupled with Terascale I/O subsystems, and thus, are appropriate for these superclusters rather than simply Beowulf clusters [7]. Finally, supercomputers, the largest of the clusters, may provide the ultimate platform for the highestfidelity models and simulations and dataintensive applications. At each step up this ladder of cluster architectures, the system costs increase approximately by a factor of two or three, but having the high-performance capability to solve the problem of interest may require an entry point at any of these levels, namely at the Beowulf, or supercluster, or even the supercomputer platform.
Cluster computing with Beowulf- or superclusters may be three or four orders of magnitude less expensive for hardware than a supercomputer yet still provide a cost-effective solution and capability that was not available on a workstation. But this cost comparison measures just the basic system price. What about the applications? As recently as a few years ago, the lack of a widely available agreedupon standard for a high-performance architectural paradigm meant that migrating to a new computer system, even from the same vendor, typically required redesigning applications to efficiently use the highperformance system. Thus, the application must be redesigned with parallel algorithms and libraries that are optimized for each highperformance system. This process of porting, redesigning, optimizing, debugging, and analyzing the application may in fact be prohibitively expensive, and certainly frustrating when the process is completed just in time to greet the next generation system! With the successful standardization of system-level interfaces, many of these issues no longer hinder our expedited use of these new, highperformance clusters.
The U.S. National Science Foundation (NSF) currently supports two large advanced computing partnerships, the National Computational Science Alliance (NCSA) based at the University of Illinois, Urbana-Champaign, and the National Partnership for Advanced Computing Infrastructure (NPACI) at the University of California, San Diego. Traditionally these centers have purchased large, expensive parallel machines such as the SGI Origin 2000 and IBM SP. Researchers that require large computational resources can request time on these machines.
The dilemma is that these machines are so heavily utilized that jobs will often sit in the queue for weeks before they are able to run, and quite understandably, many researchers find this delay unacceptable. It is also the case that many applications do not require the complex architectural features available on these large, expensive, parallel machines. For example, the SGI Origin has the ability to provide nonuniform shared memory, but this feature may not be required, or worse, adversely affect the performance of codes that have not been fully tuned. In addition, several classes of applications tend to be computationally demanding with moderate or minimal communication. Examples include Monte Carlo simulations, computational fluid dynamics problems, and lattice computation problems that are used in computational physics. Here Beowulf or superclusters are the cost effective compute platform.
To address these needs, scientists at the Albuquerque High Performance Computing Center, a center of the High Performance Computing, Education and Research Center (HPCERC), a Strategic Center within The University of New Mexico (UNM) and National Computational Science Alliance (the Alliance) established the first NSF-funded Linux supercluster at UNM using PC-class (Intelbased) hardware, a high-performance interconnection network, and the Linux operating system. This production cluster costs a fraction of the price of a commercial supercomputer, and performs well for many classes of applications. The NSF is also funding a 512/1024 processor machine, also to be located at UNM. The Alliance is proposing a Terascale Linux Supercluster soon to NSF as well to be located at NCSA in Champaign- Urbana, Illinois.
What is a Supercluster?
Currently there are many commodity clusters being built using mass-market commercial "off the shelf' (COTS) parts. Typically such clusters involve purchasing or ordering the necessary components such as PCs with Fast Ethernet NICs, and a hub or a switch to connect them to together. Then comes the installation of the operating system as well as the necessary parallel processing software packages. These types of clusters are often classified as "Beowulf Systems" [2]. Our supercluster design on the other hand follows a different principle and does not require computational scientists also to be expert computer hobbyists. Superclusters, while PC-based, are integrated (the hardware and software) by a vendor and may have additional supercomputing-like hardware, such as a highperformance networks, hardware monitors, or large disk or parallel storage, that improves scalability. In this manner, a supercluster may support the scalable execution of a greater number of applications than a Beowulf-class machine.
The hardware isn't enough to build a low-cost supercomputer: software that can run parallel jobs and schedulers to manage the nodes are needed as well. Much open source software is available, including compilers, the message passing implementation, schedulers and computational grid infrastructure software.
AHPCC Projects: Roadrunner, Terascale Computing, and Vista Azul
The AHPCC is currently involved in several supercluster projects. The first, Roadrunner, explored turning a commodity off the shelf PC cluster into a production supercluster. Linux, message passing, job scheduling (Portable Batch System and Maui Scheduler) and compilers (EGCS and PGI) are needed to complete the transformation of a standard cluster of PCs into a parallel machine. Application performance (CACTUS, Direct Numerical Simulation of Turbulence) demonstrates the high performance nature of the supercluster. And computational grids (Globus and the Virtual Machine Room) provide greater access for the research scientist. The next evolutionary step for superclusters is a national strategic direction for Terascale computing. A project called Vista Azul is performing research to study and improve supercluster performance.
The Roadrunner Project
The motivation behind Roadrunner is "capability computing," where users intend to heavily use one or more of the available computational resources and to continuously execute applications over a long period of time (on the order of days to weeks). We are attempting to solve grand-challenge class problems that require large amounts of resources and scalable systems unavailable in Beowulf PC clusters. What the Roadrunner project demonstrates is a well-packaged, scalable cluster computer system that gives excellent computational power per dollar.
The Roadrunner supercluster arrived as a unit, and came pre-installed with Linux, a programming environment, and custom disk cloning software that allows a system administrator simply to configure a single node, and then clone it to the rest of the machine. This approach means the system, as a whole is easier to get running initially and maintain into the future.
Roadrunner operates as an open production computing resource for the Alliance, through the Globus Grid Infrastructure. There are over 100 users of this system running a wide variety of scientific codes.
Roadrunner System Hardware
The UNM/Alliance Roadrunner supercluster is an Alta Technology Corporation 64-node SMP AltaCluster containing 128 Intel 450 MHz Pentium I1 processors. The supercluster runs the 2.2.14 Linux kernel in SMP mode with communication between nodes provided via a high-speed Myrinet network (full-duplex 1.2 Gbps) or with Fast Ethernet (IOOMbps). The AltaCluster is also outfitted with an externalmonitoring network that allows supercomputing-like features; individual processor temperature monitoring, power cycling, or resetting of the individual nodes.
Each node contains components similar to those in a commodity PC, for instance, in the case of Roadrunner, a 100 MHz system bus, 5 1 2 D cache, 512 MB ECC SRAM, and a 6.4 GB hard drive. However, the nodes do not require video cards, keyboards, mice, or displays.
The Myrinet topology consists of four octal 8-port SAN switches (M2M-OCT-SW8 Myrinet-SAN), each with 16 front-ports attached to each of 16 nodes, and eight (8) SAN back-ports attached to the other switches. The Fast Ethernet uses a 72-port Foundry switch with Gigabit Ethernet uplinks to the vBNS and Intemet.
Roadrunner System Software
There are a number of core components that we have identified for running a production supercluster like Roadrunner. As discussed in previous sections, the first is a robust UNIX-like operating system. After that, the only add-ons are programs that allow one to use the cluster as a single, parallel resource. These include MPI libraries, a parallel job scheduler, application development software, and programming environment.
Why Linux?
Linux was chosen as a base for the supercluster for several reasons. The scientific community has a history of success with freely available applications and libraries and has found that development is improved when source code is openly shared. Application codes are generally more reliable because of the multitude of researchers studying, optimizing, and debugging the algorithms and code base. This is very important in high performance computing. When researchers encounter performance problems in a commercial computing environment, the vendor must be relied upon to fix the problem that may impede progress.
Free operating systems have recently matured to the point where they can be relied upon in a production environment. Even traditional HPC vendors like IBM and SGI are now supporting Linux. This is clearly the direction in which the world of high performance computing is moving.
The particular version of Linux we are running on the supercluster is the VA Systems Redhat Linux distribution version 6.1, with a custom compiled SMP kernel 2.2.14. It is important to note that our system would work equally well under another Linux distribution or another free OS altogether such as FreeBSD or Net BSD.
There are several other reasons for using Linux: low to no cost, SMP support performs well in practice (continuously improving), and wealth of commercial drivers and applications hat are easily ported or written for Linux. Ease of development is another very important factor in choosing Linux. Because of the availability of Linux, scientists may develop algorithms, code, and applications on their own locally available machines and then run on Roadrunner with little or no porting effort. Nearly all of the software used on Roadrunner is freeware making it very easy for other groups to setup a similar system for development at their home site.
Message Passing Interface (MPI): MPICH
MPI provides a portable, parallel messagepassing interface. On Roadrunner we are using MPICH, a freely available, portable implementation of the full MPI specification developed by Argonne National Labs [3]. We chose MPICH over the commercially available MPI implementations for several reasons. MPICH has been very successful, and is essentially the de-facto standard for freely available MPI implementations largely because it is the reference implementation of the MPI standard. Commercial MPI implementations also tend to be operating system dependent. MPICH on the other hand is available in source code format and is highly portable to all the popular UNIX operating systems including Linux.
Two general distributions of MPICH are available on Roadrunner. First, the standard Argonne release of MPICH is available for those using the Fast Ethernet interface. Also available is MPICH over GM that uses the Myrinet interface. MPICH-GM is distributed by Myricom. Within each of these distributions two builds of MPICH are available based upon the back-end compilers. One set uses the freely available GNU project compilers, and the second uses the commercial Portland Group suite of compilers. Thus, the user has four flavors of MPICH to choose from on Roadrunner. All MPICH distributions on Roadrunner are configured to use the Secure- Shell (SSH) remote login program for initial job launch. SSH is a more secure alternative to RSH without compromising performance.
Parallel Job Scheduling
It is also very helpful in a clustered environment to deploy some type of parallel job scheduling software. Users running a parallel application know beforehand that there is a finite amount of resources such as memory and CPU available for their application, so they naturally want to use all of these resources. In this capability-computing environment, it is not sufficient to rely solely on a friendly user policy. Many of the codes on Roadrunner are very time consuming, and the risk of job conflict is therefore great. This risk is compounded further as the cluster size increases, and the more users the machine accommodates. Parallel job scheduling software eliminates these types of conflicts and at the same time enforces the particular site usage policy. Two different types of schedulers run on Roadrunner: Portable Batch System and Maui Scheduler.
Portable Batch System (PBS)
For parallel job scheduling, Roadrunner uses the Portable Batch System (PBS), which was developed as NASA Ames Research Center [4]. PBS is a job scheduler that supports management of both interactive and batch style jobs. The software is configured and compiled using popular GNU autoconf mechanisms, which makes it quite portable including to Linux. PBS is freely distributed in source-code form under a software license that permits a site to use PBS locally, but does not allow a site to redistribute it. PBS has some nice features including built-in support for MPI, process accounting, and an optional GUI for job submission, tracking, and administration of jobs.
Virtual Machine Room
From Maui to Boston, the Alliance Partners for Advanced Computational Services (PACS) has developed the infrastructure to create a Virtual Machine Room. This project links many f the Alliance's high performance and high throughput systems together through a web interface. Users can submit jobs and monitor them through this easy-to-use software. This sophisticated system of authentication, directory, accounting, file, scheduling services is seamlessly integrated.
Superclusters and Terascale Computing: A National Strategic Direction
HPCERC, and its two centers, the AHPCC and the Maui High Performance Computing Center (MHPCC) are part of the Alliance led by Dr. Larry Smarr at The University of Illinois at Urbana-Champaign (UIUC). The Alliance provides production computing resources for NSF researchers and collateral programs. The Alliance represents a partnership to prototype an advanced computational infrastructure for the 21st century and includes more than 50 academic, government and industry research partners from across the United States. The Alliance is one of two partnerships funded by the National Science Foundation's Partnerships for Advanced Computational Infrastructure (PACI) program and receives cost-sharing at partner institutions. The second partnership is the National Partnership for Advanced Computational Infrastructure (NPACI), led by the San Diego Supercomputer Center.
The successful deployment of the 128 processor IA-32 Roadrunner Supercluster, an Alliance resource, at AHPCC helped to solidify the Alliance's strategic computing direction. Currently, national researchers heavily use Roadrunner. Due to this success, the Alliance has decided to fund a similar supercluster consisting of at least 512, and maybe as many as 1024 processors. This supercluster will have the high bandwidth, low latency interconnect, Myrinet, as well as a high performance storage system. This new machine is part of the Alliance supercluster strategy. That strategy is to make cluster computing an essential and integrated part of new high-end capability computing for U.S. academic researchers, business, and industry. The acquisition and management of this new system is a joint activity inside the Alliance with strong participation by the National Center for Supercomputer Applications (NCSA at UIUC) and the other Alliance partners.
This new supercluster represents the first known deployment of a supercluster of this size for open production computing. The user base will be academic research users from across the ountry representing National Science Foundation projects. Specific and broadly representative application code, such as those represented by the Alliance Application Technology (AT) teams, will be run and benchmarked on this compute platform. Of critical importance is developing the management tools to provide supercluster computing as a reliable stable and scalableplatform for production capability computing. As such, researchers from the Alliance Enabling Technology (ET) Teams are expected to deploy its newest tools on this cluster. Especially important is the expected deployment of the UNM Maui Scheduler to manage the flow of jobs on this computer.
Conclusion
This series of production and research projects, including Roadrunner, the move to a Terascale supercluster and Vista Azul has helped to improve the performance and use of superclusters. Utilizing commodity off the shelf parts combined with a high bandwidth, low latency interconnect and open source software (Linux), a high performance computer can be created. They fill a special niche by providing high performance computing at a reduced cost. Having both production and research superclusters, lessons learned from each unique environment were reapplied to benefit all projects. With the maturing of computational grids, superclusters provide an excellent resource for national and international scientists.
Acknowledgements
Many have contributed through their efforts and support. We acknowledge many discussions with Brian T. Smith, John Sobolewski, David A. Bader, A. Barney Maccabe and the HPC Systems Group at AHPCC.
References
[ll The Condor Project. www.cs.wisc.edu/condor
[2] D. Becker, T. Sterling, D. Savarese, J. Dorband, U. Ranawak, and C. Packer. Beowulf: A Parallel Workstation for Scientific Computation. In Proceedings of the 1995 International Conference on Parallel Processing, volume 1, pages 11-14, August 1995.
[3] Message Passing Interface Forum. MPI: A Message- Passing Interface Standard. Technical report, University of Tennessee, Knoxville, TN, June 1995. Version 1.1.
[4] MRJ Inc. The Portable Batch System (PBS). textitpbs.mrj.com.
[5] Maui Scheduler. www.mhpcc.edu
[6] G. Allen, T. Goodale, and E. Seidel. The Cactus Computational Collaboratory : Enabling Technologies for Relativistic Astrophysics, and a Toolkit for Solving PDE’s by Communities in Science and Engineering. In Proceedings of the Seventh IEEE Symposiurn on the Frontiers of Massively Parallel Computation, pages 36-41, Annapolis, MD, February 1999.
[7] D. Bader, A. Maccabe, J. Mastaler, J. Mclver, and P. Kovatch. Design and Analysis of the Alliance/University of New Mexico Roadrunner Linux SMP SuperCluster. In Proceedings of the First IEEE Workshop in Cluster Computing, Melbourne, AU, December 1999.
[8] G-S Karamanos, E. Evangelinos, R.C. Boes, R.M. Kirby and G.E. Kamiadakis. Direct Numerical Simulation of Turbulence with a PCLinux Cluster: Fact or Fiction? In Workshop for Supercomputing 99, Portland, OR, November 1999.
[9] 1. Foster and C. Kesselman, editors. The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, San Francisco, CA, 1999.