NAME

Status_Daemon - CIT monitoring daemon


OVERVIEW

The status daemon is a tool to monitor and trend various portions of a system. It leverages the hierarchy set up in the configuration database to distribute the substantial amount of work it must accomplish across non-compute nodes in the system. The status daemon runs on each leader and admin node in the system, and optionally on service nodes.

Each instance of the daemon has two parts. The first part is the server portion, which listens for status updates and services requests for status information from clients. The other part periodically performs local tests (uptime, load average, etc.) on itself and remote tests on its children. The remote tests may be non-intrusive (such as ping, status, show_temp, etc.) or may run on the node (such as tests to run on service and I/O nodes to watch pingd, qstat, qmgr, ENFS mounts, etc.).

Tests are fairly easy to write and can be made to query more than just compute nodes. All devices in the system can be monitored, though the focus so far has been on the nodes in the system. The tests listed above (and others not mentioned) were developed based on the tasks that the production staff performs manually every day.

Periodically, status information is be passed up the hierarchy. At each level, the status of every device below that leader/admin is stored in memory. Therefore, at the head admin node (as defined in the database---preferable the logging node at Sandia), current status information for the entire system is resident in memory. This allows quick replies to queries for recent status information.

At the highest level daemon, status information for the entire system is periodically written to disk. Tests that define certain data to be trended are also processed here, with the data stored in round-robin database files. These files compute rolling averages of the incoming data and are well suited to graphing over time for trend analysis. Having a graphical output can make it easier to find and understand correlations between various events in the system.


PROCESS FLOW

See status_daemon_flow for a pretty diagram.


CONFIGURATION

See csdaemon.conf.SAMPLE.xml for configuration options.


EXECUTION

The easiest part of the whole process---simply run the command:

 # status_daemon

Other command-line options are detailed by running:

 # status_daemon --help

Sandia Sidebar

On the west admin node, the daemon will restart itself if it dies, using a user defined rstate in inittab. The line that does this is in /etc/inittab:47...

 sd:a:ondemand:/cluster/bin/status_daemon

If you want to stop the status daemon on west-sss1-0, edit that line to read:

 sd:a:off:/cluster/bin/status_daemon
      ^^^
and issue the command:
 # telinit a

This should end the process. If for some reason it doesn't, terminate the daemon with the 'kill' command. To restart, change the command back, and issue another 'telinit a'. 'man inittab' or e-mail me for more information on this.

On the leader nodes and node.n-6.g-1 (the lone service node running a daemon), the daemon was started with the command:

 # status_daemon &

Simply looking at the process list and issuing a:

 # kill <pid>

will stop the daemon. To restart, simply issue the 'status_daemon &' command.


VIEWING STATUS

Status can be viewed with the 'status_client' command. There are many options, but only going a few will be covered here. To see all command-line options, run

 # status_client --help

To see all nodes, with all tests listed (including status of those tests), run:

 # status_client -i

This will display something like:

 * admin-0       df(0) ps(2) uptime(0) 
 * \_ node.n-0.g-0       ping(0) serial(0) status_server(0) uptime(2) 
   |  \_ node.n-0.g-1    ping(0) serial(0) 
   |  \_ node.n-1.g-1    io_mount(0) ping(0) serial(0) 
   |  \_ node.n-2.g-1    ping(0) serial(0) 
   |  \_ node.n-3.g-1    io_mount(0) ping(0) serial(0) 
   |  \_ node.n-4.g-1    ping(0) serial(0) 
   |  \_ node.n-5.g-1    io_mount(0) ping(0) serial(0) 
 * |  \_ node.n-6.g-1    infoerr(0) infomcp(0) infoprotocol(0) ping(0)
 pingd(0) ps(2) qmgr(0) qstat(0) queues(0) serial(0) uptime(0) 
 .....

The '*' on the left indicates that node has a problem with one of its tests and/or that a child (at any lower level) has a problem. After the node name, all the tests that have been run on that node are listed, with their status in ()'s. A status of 0 is good. A status of 10 means that the test was taking too long and timed out. Any other non-zero status is bad, with 1 being the most severe.

On a system like west, showing everything can be too much information, and most of the nodes are doing fine, anyway. There is a quiet option to only show nodes/tests that have problems:

 # status_client -iq
 * admin-0       ps(2) 
 * \_ node.n-0.g-0
 * |  \_ node.n-6.g-1    ps(2) 
 * |  \_ node.n-2.g-3    serial(1) 
 * \_ node.n-1.g-0
 * |  \_ node.n-6.g-6    serial(1) 
 * |  \_ node.n-6.g-7    ping(1) serial(1) 
 * \_ node.n-3.g-0
 * |  \_ node.n-1.g-13   serial(1) 
 * |  \_ node.n-3.g-15   serial(1) 
 * \_ node.n-4.g-0
 * |  \_ node.n-1.g-19   serial(1) 
 * \_ node.n-5.g-0
 * |  \_ node.n-3.g-22   serial(1) 
 * \_ node.n-0.g-24
 * |  \_ node.n-2.g-24   serial(1) 
 * \_ node.n-0.g-25
 * |  \_ node.n-20.g-25  serial(1)

To see the full text of a test (or all tests), use -v (verbose) instead of -i (information). You can also use the familiar syntax to cut down the nodes to display:

 # status_client -v -g 7 -n 6
 * admin-0
 * \_ node.n-1.g-0
 *    \_ node.n-6.g-7
              1 ping                 (Tue Aug  5 07:39:16 2003)
            PING if-0.n-6.g-7 (192.168.58.48) from 192.168.58.1 : 56(84)
 bytes of data.
            From if-1.n-1.g-0 (192.168.58.1): Destination Host
 Unreachable
            
            --- if-0.n-6.g-7 ping statistics ---
            4 packets transmitted, 0 packets received, +1 errors, 100%
 packet loss
            
              1 serial               (Tue Aug  5 07:38:48 2003)
            STATUS: node.n-6.g-7 UNKNOWN discovered
 
 # status_client -i -g 13
 * admin-0
 * \_ node.n-3.g-0
      \_ node.n-0.g-13   ping(0) serial(0) 
 *    \_ node.n-1.g-13   ping(0) serial(1) 
      \_ node.n-2.g-13   ping(0) serial(0) 
      \_ node.n-3.g-13   ping(0) serial(0) 
      \_ node.n-4.g-13   ping(0) serial(0) 
      \_ node.n-5.g-13   ping(0) serial(0) 
      \_ node.n-6.g-13   ping(0) serial(0) 
      \_ node.n-7.g-13   ping(0) serial(0)

Another useful option is --test. It allows you to cut down the tests that are being displayed. So, to see all the failed serial tests, with their full text, we do:

 # status_client -vq --test serial             
 * admin-0
 * \_ node.n-0.g-0
 * |  \_ node.n-6.g-1
 * |  \_ node.n-2.g-3
   |  |       1 serial               (Tue Aug  5 07:36:34 2003)
   |  |     STATUS: node.n-2.g-3 UNKNOWN discovered
   |  |     
 * \_ node.n-1.g-0
 * |  \_ node.n-6.g-6
   |  |       1 serial               (Tue Aug  5 07:38:45 2003)
   |  |     STATUS: node.n-6.g-6 UNKNOWN discovered
   |  |     
 * |  \_ node.n-6.g-7
   |  |       1 serial               (Tue Aug  5 07:38:48 2003)
   |  |     STATUS: node.n-6.g-7 UNKNOWN discovered
   |  |     
 * \_ node.n-3.g-0
 * |  \_ node.n-1.g-13
   |  |       1 serial               (Tue Aug  5 07:38:50 2003)
   |  |     STATUS: node.n-1.g-13        UNKNOWN discovered
   |  |     
 * |  \_ node.n-3.g-15
   |  |       1 serial               (Tue Aug  5 07:38:58 2003)
   |  |     STATUS: node.n-3.g-15        UNKNOWN discovered
   |  |     
 * \_ node.n-4.g-0
 * |  \_ node.n-1.g-19
   |  |       1 serial               (Tue Aug  5 07:40:52 2003)
   |  |     STATUS: node.n-1.g-19        UNKNOWN discovered
   |  |     
 * \_ node.n-5.g-0
 * |  \_ node.n-3.g-22
   |  |       1 serial               (Tue Aug  5 07:38:38 2003)
   |  |     STATUS: node.n-3.g-22        UNKNOWN discovered
   |  |     
 * \_ node.n-0.g-24
 * |  \_ node.n-2.g-24
   |  |       1 serial               (Tue Aug  5 07:35:42 2003)
   |  |     STATUS: node.n-2.g-24        UNKNOWN discovered
   |  |     
 * \_ node.n-0.g-25
 * |  \_ node.n-20.g-25
   |  |       1 serial               (Tue Aug  5 07:36:25 2003)
   |  |     STATUS: node.n-20.g-25       UNKNOWN discovered
   |  |


CURRENT TESTS

General Tests

df - Check disk partitions
Warn (status=2) if they are over 80% full. Panic (status=1) if they are over 95% full

ps - Check process table
Check for certain processes that should/should not be running. Not quite tweaked correctly, but it's close.

uptime - Check load average
Warn/panic if the load average gets too high.

ping - Check network connectivity
serial - Check serial port connectivity
status_server - Check child status daemons
Status of the daemon on that node, as seen from the parent daemon: 0 - the server on the leader is working properly 1 - the leader closed the connection 2 - the leader has not sent a status update in too long a time

Cplant Specific Tests

io_mount - Check that enfs mounts are working
infoerr - Check message errors in system
infomcp - Check MCP stats in system
infoprotocol - Check protocol stats in system
pingd - Check free/busy nodes as seen by pingd
qmgr - Check free/busy nodes, number of jobs in qmgr
qstat - Check allocated nodes, number of jobs in qstat
queues - Ensure that qstat, qmgr, pingd all see the same system state


FUTURE WORK

Additional work that could benefit users of the status daemon includes integration with the command daemon, additional tests, and GUI development. Command daemon integration would allow commands to be run more efficiently across the system be leveraging current status of nodes to avoiding misbehaving nodes and even save effort if recent (within a minute or so) data is good enough. Additional tests, especially of Myrinet switches and file systems, could provide even more useful data to be analyzed in conjunction with the other system tests. GUI work could include any number of interfaces, including an HTML page describing the current state of the system and/or adding the ability to get status in the current cgui program.