############################################################################# # # This Cplant(TM) source code is the property of Sandia National # Laboratories. # # This Cplant(TM) source code is copyrighted by Sandia National # Laboratories. # # The redistribution of this Cplant(TM) source code is subject to the # terms of the GNU Lesser General Public License # (see cit/LGPL or http://www.gnu.org/licenses/lgpl.html) # # Cplant(TM) Copyright 1998, 1999, 2000, 2001, 2002, 2003, 2004 # Sandia Corporation. # Under the terms of Contract DE-AC04-94AL85000, there is a non-exclusive # license for use of this work by or on behalf of the US Government. # Export of this program may require a license from the United States # Government. # ############################################################################# Welcome to the CIT! -------------------- What follows is information on the Cluster Integration Toolkit (CIT). This document attempts to outline what it is, where we are now, what is the vision for the future, and some tasks to be done in the near future (Release III). WHAT IS THE CIT? -------------------- The Cluster Integration Toolkit is a set of software tools for configuring, testing, and managing a (Linux) cluster. The CIT is designed to be architecture and hardware independent and provide a stable platform for all runtime environments and user applications. The CIToolkit is an attempt to meet the following goals: - Provide an efficient and standardized cluster integration processes - Automate as many install and implementation procedures as possible - Provide tools that leverage a single database for configuration, testing, systems management, performance management, etc. - Reduce hardware related labor due to the large number of commodity parts (generally with lower quality assurance) used in an environment where high reliability is demanded. - Increase reliability and repeatability of the process to bring clusters online. - Scale easily to 1000's of nodes, both in performance and manageability. CIT MODULES -------------------- Currently, the CIT contains the following modules: base -- the top-level module (required for all other modules) chits -- the cluster hardware issue tracking system config -- support for various system architectures as well as example configuration files from existing clusters cplant -- cplant-light -- devtools -- scripts and documentation for anyone doing development work for CIT (not meant to be installed) diag -- hardware test, benchmark, and diagnostic tools for verifying that a cluster is operating consistently and optimally diskfull -- enables diskfull booting diskless -- enables diskless booting using root-over-NFS extras -- provides advanced features, such as the status_daemon monitoring tool ganglia -- provides necessary support for the ganglia monitoring framework ilo -- support for the ILO server management protocol ipmi -- support for the IPMI server management protocol light-os -- myrinet -- automates the process of installing GM, MPICH, and other custom tools on a cluster that uses the Myrinet interconnect distros -- software configuration tools, libraries, and distro class hierarchy torque-maui -- integrated support for the torque batch scheduling system utilizing the maui scheduler WHAT DO WE HAVE NOW? -------------------- Currently, CIT consists of two major components: 1. Configuration database and "infrastructure tools" developed at Sandia National Labs for the Cplant cluster environment. This includes tools for configuring a diskless cluster environment as well as discovering, booting and monitoring nodes and supporting hardware. 2. Test and diagnostic tools and procedures developed by HPTi for the Cplant cluster (and for Forecast Systems Lab). This is an "add on" to the infrastructure tools. It includes third party test and diagnostic programs, tools which leverage the configuration database, specialized myrinet diagnostic tools, and programs for generating custom myrinet routes for the Sandia topology. Most of the existing CIT software is written in Perl, (the exception being that the myrinet tools are largely written in C) making it easily exportable to many unix-like operating systems. A limited amount of hardware is directly supported at this time, but the framework provides for relatively easy addition of unlimited new "device drivers". The device drivers for the Sandia Cplant cluster hardware have been thoroughly tested and support for additional devices in the HPTi testbed/HRCT cluster has recently been written. THE VISION -------------------- The overarching vision for the CIT is to turn as many integration procedures into automated tools as is possible. This includes all parts of the process from cluster design, hardware purchasing and equipment labeling to systems management, monitoring, performance tuning and capacity planning. All tools in the toolkit should leverage the configuration database to reduce overall configuration time. (A reasonable model for this would be the Black&Decker cordless tools: Just charge one battery and it works in any tool!) The initial population of the database may be somewhat complicated, and require extra experience, as it involves detailed knowledge of the cluster architecture. However, once the database is configured, any user with reasonable Unix system administration experience should be able to use all the tools in the toolkit. The Cluster Integration Toolkit should easily deal with multiple hardware and software architectures. Differences between Alpha and i386, diskfull and diskless, hierarchical (with "leaders") and flat designs, Myrinet and Gig-E, GM and Cplant, MPICH and PVM, should all be handled with minimal impact to the user. A "snap-in" architecture should be used to allow the user to simply add the features they wish to use. To support this, the CIT needs well-defined API's at both the low-level database layer and at a higher "runtime application" level. Adding new modules to follow advances in hardware should be easy and have little or no impact on other tools in the toolkit. The CIT should be able to mitigate heterogeneity, transparently handling multiple snap-in's simultaneously. The CIT should also function as a repository for known software updates. Seldom is there an OS kernel version or application that does not need some fix, hack, or unusual configuration, especially when one is trying to make it work efficiently in a cluster environment. When possible, these fixes should be made automatically upon configuration of the CIT or installation of the appropriate snap-in module. Otherwise, fixes should be clearly documented. The CIToolkit should be leveraged and extended for further tools development in the areas of systems management, cluster usage data capture and presentation, and performance management. Other useful extensions of the CIT include hardware issue tracking for the purposes of problem diagnosis and recovery, inventory control, and capacity planning. Ideally, the configuration database should be folded back into hardware integration & installation phase. Filling the configuration database should be part of the hardware specification process. With the database in place before the hardware, and the addition of a couple new tools to the CIT, it is possible to have vendor-printed labels installed on all hardware to match the configuration database well ahead of time, thus speeding hardware installation and later troubleshooting.