CplantTM System Software

CplantTM system software (shown in Figure 1) is a collection of code designed, with an emphasis on scalability, to provide a full-featured environment for cluster computing on commodity hardware components. For example, CplantTM system software provides a scalable message passing layer, scalable runtime utilities, and scalable debugging support. CplantTM system software is distributed as source code which can be built for a specific hardware configuration. This source code consists of operating system code (in the form of Linux modules and driver), application support libraries and compiler tools, an MPI port, user-level runtime utilities, support for application debugging, and scripts for configuring and installing the built software.

For its network interconnect, CplantTM system software utilizes Myricom's Myrinet technology. Myrinet is a cost-effective gigabit packet and switching technology with programmable interfaces. CplantTM system software for Myrinet consists of NIC firmware and related tools, a Linux driver and packetization module with error detection/correction and debugging support, along with various in-house monitoring and diagnostic tools.

In terms of programming languages, CplantTM system software is coded mainly in C. This code is built on a Linux system using the gnu compiler, while support for parallel applications programming is provided for C, C++, and Fortran code using MPI. The latter code can be built with gnu compilers, or in the case of an Alpha-based installation, Compaq Alpha-Linux and Tru64 compilers are supported.

Support for compiling and installing code is based largely on the make utility along with Perl and Bash scripts. Installation support is provided in the context of the CplantTM Virtual Machine (or VM). A VM is a logical partition of hardware components: a single hardware installation can run multiple independent virtual machines. To support this concept, built code is installed in a separate VM structure for each active virtual machine. Code in a VM is then configured to run on a specific hardware subset.

Figure 1 shows a simplified overview of the entire CplantTM System. The major blocks for preprocessing are: SYSTEM PREP and APPLICATION PREPROCESSOR. SYSTEM PREP works in conjunction with the Root File System. Similarly the APPLICATION PREPROCESSOR works with a user disk which has user accounts and the compiled executables (exe) reside there. Major runtime blocks are: MONITOR, SERVICE, COMPUTE, and IO, working in concert with the Storage IO and the other two disk systems mentioned earlier (Root File System, and user disk). Some representative functions within each block are listed in Figure 1.

The entirety of CplantTM system software is depicted hierarchically in a tree starting at the top of Figure 2.1. The highest level software components are: IO, Compute, Config, Doc, Include, Lib, Makefiles, Regression, Release, Scripts, Service, Support, and Tool. The remaining levels of the hierarchy are shown in the rest of Figure 2.1, Figure 2.2, and Figure 2.3. The number of lines of code in each component is indicated, based on the state of the system as of April, 2002.

The major CplantTM system software components in the context of a working CplantTM system are as follows:

OS Kernel [Compute]*: CplantTM system software extends the stock Linux OS kernel in two ways. First, by applying a small CplantTM patch to the distribution source code, and second, by adding a number of dynamically loadable kernel modules: p3mod, cTask, addrCache, myrIP, and rtscts. These modules take on the following respective functions: portals 3 scalable message passing, portals process accounting, kernel cache of user address space mappings, IP protocol over myrinet, and the myrinet driver. In addition, the myrinet driver is augmented by an MCP (myrinet control program) that runs on the myrinet NIC. This body of software runs on the CplantTM cluster of processor nodes and forms the main processing and messages passing subsystem. This system is divided into service, compute, and IO partitions. The service partition is where users login and submit jobs. The compute partition is where the computations proceed. The IO partition supports data storage and retrieval.

Runtime Utilites [Service]: A set of user-level threads that support running jobs on the CplantTM system. On the service partition, these consist of yod, bebopd, pingd, and showmesh, which function respectively as job launcher, node allocator, query/admin tool and display tool. On the compute partition, the pct [Compute]functions as a process control thread. These utilitities are augmented on the service and compute partitions by a set of processes that provide debugging support to users. Bt and cgdb live on the service partition, providing front ends to users. Gwrap lives on the compute node and serves as a portals proxy to gdb (the GNU debugger). Support is also provided through this system for parallel debugging in conjunction with TotalView.

IO Support [IO]: In addition to CplantTM IO redirection libraries that get linked with user applications, this component includes external IO support as well, which consists of the fyod portals disk proxy along with the enfs proxies and servers for support of parallel IO.

Application Libraries [Compute/lib]: These libraries are supplied to the users in the CplantTM compile environment. They are automatically linked in with the user's own application modules at compile time. They consist of: portals 3, MPI (message passing interface), puma, and IO redirection libraries along with a startup module.

Server Library [Lib/comm]: This consists of a library of portals 3 calls used specifically by the runtime utilities. It resides in the CplantTM system compile environment.

Batch System [Service/pbs4cplant]: A port of PBS (Portable Batch System) to CplantTM . Provides a queueing system that allows users to submit jobs in batch mode.

Support Utilites [Support]: A wide variety of support utilities exist to provide mainly administrative-related functions: MCP load, stat query, myrinet diagnostic, application registration, etc. These live for the most part on CplantTM's service partition.

Compile Environments [Compute/tool]: System and user compile environments are supplied with CplantTM system software. The user compile environment consists of CplantTM application libraries, header files, and compile scripts which can be utilized in conjunction with existing Tru64, Compaq Linux, or GNU compilers to build applications for CplantTM . The system compile environment, which is embodied in the CplantTM system source distribution consists of compile scripts, installation scripts, a makefile system, and the server library code. These are used in conjunction with the GNU C compiler to build CplantTM OS components, the myrinet MCP, and runtime and support utilities, and to install resulting system components in the CplantTM VM (virtual machine) structure.

System Administration, Monitoring, and Diagnostic tools [various]: A subsystem of administration software for the purpose of configuring, booting, maintaining, monitoring, and diagnosing the CplantTM system proper along with user accounts, user logins, user file systems, myrinet interconnect, and ethernet backplane.

Regression Testing [Regression]: A set of automated Bash and Perl scripts that, on a nightly basis, builds CplantTM system software, installs code on a test cluster, and runs a number of application test for reliability and benchmarking purposes.

* Higher level components are denoted in brackets.