############################################################################# # # This Cplant(TM) source code is the property of Sandia National # Laboratories. # # This Cplant(TM) source code is copyrighted by Sandia National # Laboratories. # # The redistribution of this Cplant(TM) source code is subject to the # terms of the GNU Lesser General Public License # (see cit/LGPL or http://www.gnu.org/licenses/lgpl.html) # # Cplant(TM) Copyright 1998, 1999, 2000, 2001, 2002, 2003, 2004 # Sandia Corporation. # Under the terms of Contract DE-AC04-94AL85000, there is a non-exclusive # license for use of this work by or on behalf of the US Government. # Export of this program may require a license from the United States # Government. # ############################################################################# Another Cluster Management Example (ACME) ========================================= The purpose of the ACME set of examples is to get you started using the Cluster Integration Toolkit (CIT) using a virtual cluster environment. In this way you can try some of the basic commands and infrastucture of the toolkit and familiarize yourself with it. The ACME device modules base/lib/Device/Node/ACME.pm base/lib/Device/Power/ACME.pm base/lib/Device/TermSrvr/ACME.pm also provide highly commented code to get you started developing your own device modules to support more hardware utilizing CIT. If you would like more information on the philosophy and design of CIT, please see the CIT publications page at http://www.cs.sandia.gov/cit/publications/index.html. This document assumes that you have already installed the base module. If you have not done so, then please consult the INSTALL instructions within the base module before completing this tutorial. After completing the base installation, then CIT is ready for some real data. The heart of CIT's operation on a cluster is the cluster database. This database stores ALL the information describing the hardware of the cluster. This includes but certainly isn't limited to nodes, ethernet network (switches, connections,...), terminal servers facilitating console access to nodes, remote power control of any device, high speed network (Myrinet,...). The database can truly contain any information that is required and is implemented in the CIT library via devices, utilities, etc. The database in CIT is abstracted away from database implementation using the MyDB layer (see base/lib/MyDB*). The default and most common database implementation that CIT uses is GDBM. Other options include a DBI interface and an LDAP interface. The DBI::CSV interface allows CIT to directly use a comma separated text file as the database. There are several methods for populating the database. The most common is to write a configuration script (in Perl like most of CIT) which describes the cluster hardware. Other methods include using a Perl script to import a CSV text file (see config/machines/freedom for an example of a CSV file and a import script) or writing a CSV file directly (this could be a bit tricky). The most flexible and powerful way is to write the configuration script directly and there are even several ways to do this! In any event, there are plenty of configuration script examples (in addition to what we'll do here with ACME) in the machines directory within the config module. So let's create a database for our purposes. We will describe a small, simple, cluster. Pretty much any way you want to build a cluster you will want to have a top-level node (or nodes) of some sort. We refer to these nodes as admin nodes in CIT. The ACME virtual cluster will have one admin node (admin0) with an external eth interface (eth1) and an internal eth interface (eth0) that connects to a private network inside the cluster (called a diagnostic or management network). All nodes in the cluster are connected to this network via their eth0 interface. The cluster is composed of two racks (rack1, rack2) of eight nodes (named n1-n8 and n9-n16). Each rack has a terminal server that allows access to the serial console of each node. Each rack also has an RPC (remote power controller) that the nodes are connected to. The configure-ACME-library-style.pl script in the doc directory will create the database for this ACME cluster using internal CIT library functions directly. This is the fastest and most powerful way to do database creation. # ./configure-ACME-library-style.pl which will create $(HOME)/cluster/config/cluster.db a GDBM database file. The database is populated with objects in a hierarchical fashion and thus requires a root object. This is the top-level collection in CIT. A collection is simply an object containing other objects (including other collections). Usually this top-level collection is named 'equipment'. The simplest command to interact with the database is the 'lookup' command. # lookup equipment will give you a list of the objects we populated the database with. Note the single bless(...) with Collection at the end (telling us it is an object of type Collection! See cluster/lib/Collection.pm for the code that implements this object). The " 'name' => 'equipment' " is the name (you'd never guess) of this particular Collection object and " 'bag' => ... " contains the rest of the datastructure which is just a list of names of other objects. A 'lookup' can be used on any object in the database. Look at the entry for the node named 'n1': # lookup n1 Look at the entry for the RPC in rack 1: # lookup power1 You can see that each object is composed of multiple sub-objects as indicated by the multiple bless(....) containers. Notice that each bless has its object type at the end: NetAddress::IP:TCP, NIC::Ethernet, Device::Node::ACME, etc. The code for these objects is in cluster/lib/NetAddress/IP/TCP.pm, cluster/lib/NIC/Ethernet.pm, cluster/lib/Device/Node/ACME.pm, respectively. Each object has attributes appropriate for that device, e.g. name, net_mask, and address are the basic attributes for a NIC::Ethernet object. Two of CIT's key fundamental commands are 'status' and 'power'. 'status' returns information about the state of an object in the cluster with respect to power, connectivity, OS booted, database status, etc. and 'power' controls the power to an object. # status n1 This tells you the status of the n1 node is 'READY'. READY state indicates the node is ready to boot. You can also supply the --full flag to the status command (along with many other flags -- use "status --help") to display all types of status that the device n1 provides. The "DEVICE" status displayed by default is required but devices may provide any other additional types of status if they wish. # status --full n1 This tells you the "DEVICE" status is 'READY' but in addition displays the "DISCOVERED" status, in this case 'UNKNOWN' indicating the node hasn't yet been discovered. Discovering a node or other device usually means finding the MAC address(es) associatated with the Ethernet interfaces that require information via DHCP. Try doing this: # device_mgr --set --interface 0 --mac_address 001122334455 n1 # status --full n1 Notice that the "DISCOVERED" status is still 'UNKNOWN', this is because both interfaces are not populated with a mac address. Now Try this: # device_mgr --set --interface 1 --mac_address 001122334455 n1 # status --full n1 You'll notice it says it's discovered now! This also introduces the CIT command line tool to interact with the database once it is created, device_mgr, or device manager. Try statusing collections of objects. # status equipment # status rack1 # status rack2 The rack1 and rack2 collections were created by the database creation script. Collections can be managed (new ones created, objects added, etc.) using the collection_mgr tool. Status isn't just limited to nodes. Try the RPCs for instance. # status power1 power2 Note you can specify multiple objects on the command line. Also try turning on debugging output for the CIT library code that is called on to perform the status command for an acme node. # status --libdebug Device::Node::ACME n1 Note that you get a bunch more output. This output might help you in designing your own device classes by following the order of executing throught the ACME classes. The string "Device::Node::ACME" is the class name of the n1 object. Now try the power command to get an idea what happens. Since this isn't real hardware, the power state doesn't change, but you'll get the idea. # power --off n1 # power --on n1 # power --cycle n1 'console' is the CIT command used to get access to the console of a device. This command doesn't work for ACME since there really aren't any console devices or networks for that matter! 'discover' is the CIT command used to probe and find data about the hardware in a cluster which is usually MAC addresses associated with DHCP. Try it! # lookup n2 # status --full n2 # discover n2 # lookup n2 # status --full n2 Notice that the discover process updated both interfaces for the node n2 automatically. The status command indicates "discovered" which tells you that both interfaces are assigned a mac address when previously they were not. Using the DBD::CSV Database Format ----------------------------------- You can use ACME to try another database implementation. 'db_mgr' is the CIT database manager command line utility. It is useful for manipulating the database as a whole such as dumping out its contents or what we're interested in copying it to another supported format. We'll convert the ACME database from GDBM format to DBI::CSV format (Comma Separated Value). First you'll need to make sure that Perl on your system has the Bundle::DBI and DBD::CSV CPAN modules installed. (or the perl-DBI RPM and DBD::CSV) If these aren't currently on your system, you can download these and other perl modules from www.cpan.org. (Note that you'll need gcc or another C++ compiler to install these modules.) Then make a directory for the CSV test file. # cd $(HOME)/cluster/config # mkdir csv Now, do the conversion # db_mgr --copy "DBI:CSV:f_dir=$HOME/cluster/config/csv;csv_eol=\012:acme" If all is well with your Perl and what not, notice that the text file $(HOME)/cluster/config/csv/acme appears which is the text CSV version of the database. The funky string in double quotes after --copy on the db_mgr command line is a MyDB database specification (remember MyDB from above?). See 'db_mgr --help' or 'perldoc $(HOME)/cluster/lib/MyDB.pm' for details on the db specification syntax. Take a look at the CIT cluster config module in $(HOME)/cluster/config/CConf.pm. The database layer that MyDB uses is specified by the 'db' entry in the config (search for 'db =>'). Notice that as mentioned earlier it is setup by default to use a GDBM file named .../cluster/config/cluster.db (GDBM adds the db suffix). Now comment out the GDBM line and uncomment line directly below it which has a string that looks the same as what we just used. You just changed which database CIT is using. Now try a database command: # lookup n1 Still works the same as before. Now edit the $(HOME)/cluster/config/csv/acme text file and change some attribute of n1 (for instance the netmask of 255.255.255.0 to 255.255.128.0). Query the database about n1 again and notice the netmask changed. # lookup n1 Now use the diff feature of db_mgr to note the difference in you current ACME database in DBI:CSV format (because you edited it) and the original GDBM database. # db_mgr --diff "DBI:CSV:f_dir=$HOME/cluster/config/csv;csv_eol=\012:acme" "GDBM:$HOME/cluster/config/cluster" The output of this is a familiar diff-like format and you'll see that the netmask is different in the CSV database versus the GDBM database. NOTE: Under the CIT distribution directory, see misc/ACME_csv for a sorted version of the ACME database CSV file. You can copy this file to $(HOME)/cluster/config/csv/acme (note the file extension gets dropped, blame DBD::CSV) and use it as your database directly as well with the proper 'db' line in $(HOME)/cluster/config/CConf.pm.