Status_Daemon - CIT monitoring daemon
The status daemon is a tool to monitor and trend various portions of a system. It leverages the hierarchy set up in the configuration database to distribute the substantial amount of work it must accomplish across non-compute nodes in the system. The status daemon runs on each leader and admin node in the system, and optionally on service nodes.
Each instance of the daemon has two parts. The first part is the server portion, which listens for status updates and services requests for status information from clients. The other part periodically performs local tests (uptime, load average, etc.) on itself and remote tests on its children. The remote tests may be non-intrusive (such as ping, status, show_temp, etc.) or may run on the node (such as tests to run on service and I/O nodes to watch pingd, qstat, qmgr, ENFS mounts, etc.).
Tests are fairly easy to write and can be made to query more than just compute nodes. All devices in the system can be monitored, though the focus so far has been on the nodes in the system. The tests listed above (and others not mentioned) were developed based on the tasks that the production staff performs manually every day.
Periodically, status information is be passed up the hierarchy. At each level, the status of every device below that leader/admin is stored in memory. Therefore, at the head admin node (as defined in the database---preferable the logging node at Sandia), current status information for the entire system is resident in memory. This allows quick replies to queries for recent status information.
At the highest level daemon, status information for the entire system is periodically written to disk. Tests that define certain data to be trended are also processed here, with the data stored in round-robin database files. These files compute rolling averages of the incoming data and are well suited to graphing over time for trend analysis. Having a graphical output can make it easier to find and understand correlations between various events in the system.
See status_daemon_flow for a pretty diagram.
See csdaemon.conf.SAMPLE.xml for configuration options.
The easiest part of the whole process---simply run the command:
# status_daemon
Other command-line options are detailed by running:
# status_daemon --help
On the west admin node, the daemon will restart itself if it dies, using a user defined rstate in inittab. The line that does this is in /etc/inittab:47...
sd:a:ondemand:/cluster/bin/status_daemon
If you want to stop the status daemon on west-sss1-0, edit that line to read:
sd:a:off:/cluster/bin/status_daemon
^^^
and issue the command:
# telinit a
This should end the process. If for some reason it doesn't, terminate the daemon with the 'kill' command. To restart, change the command back, and issue another 'telinit a'. 'man inittab' or e-mail me for more information on this.
On the leader nodes and node.n-6.g-1 (the lone service node running a daemon), the daemon was started with the command:
# status_daemon &
Simply looking at the process list and issuing a:
# kill <pid>
will stop the daemon. To restart, simply issue the 'status_daemon &' command.
Status can be viewed with the 'status_client' command. There are many options, but only going a few will be covered here. To see all command-line options, run
# status_client --help
To see all nodes, with all tests listed (including status of those tests), run:
# status_client -i
This will display something like:
* admin-0 df(0) ps(2) uptime(0) * \_ node.n-0.g-0 ping(0) serial(0) status_server(0) uptime(2) | \_ node.n-0.g-1 ping(0) serial(0) | \_ node.n-1.g-1 io_mount(0) ping(0) serial(0) | \_ node.n-2.g-1 ping(0) serial(0) | \_ node.n-3.g-1 io_mount(0) ping(0) serial(0) | \_ node.n-4.g-1 ping(0) serial(0) | \_ node.n-5.g-1 io_mount(0) ping(0) serial(0) * | \_ node.n-6.g-1 infoerr(0) infomcp(0) infoprotocol(0) ping(0) pingd(0) ps(2) qmgr(0) qstat(0) queues(0) serial(0) uptime(0) .....
The '*' on the left indicates that node has a problem with one of its tests and/or that a child (at any lower level) has a problem. After the node name, all the tests that have been run on that node are listed, with their status in ()'s. A status of 0 is good. A status of 10 means that the test was taking too long and timed out. Any other non-zero status is bad, with 1 being the most severe.
On a system like west, showing everything can be too much information, and most of the nodes are doing fine, anyway. There is a quiet option to only show nodes/tests that have problems:
# status_client -iq * admin-0 ps(2) * \_ node.n-0.g-0 * | \_ node.n-6.g-1 ps(2) * | \_ node.n-2.g-3 serial(1) * \_ node.n-1.g-0 * | \_ node.n-6.g-6 serial(1) * | \_ node.n-6.g-7 ping(1) serial(1) * \_ node.n-3.g-0 * | \_ node.n-1.g-13 serial(1) * | \_ node.n-3.g-15 serial(1) * \_ node.n-4.g-0 * | \_ node.n-1.g-19 serial(1) * \_ node.n-5.g-0 * | \_ node.n-3.g-22 serial(1) * \_ node.n-0.g-24 * | \_ node.n-2.g-24 serial(1) * \_ node.n-0.g-25 * | \_ node.n-20.g-25 serial(1)
To see the full text of a test (or all tests), use -v (verbose) instead of -i (information). You can also use the familiar syntax to cut down the nodes to display:
# status_client -v -g 7 -n 6
* admin-0
* \_ node.n-1.g-0
* \_ node.n-6.g-7
1 ping (Tue Aug 5 07:39:16 2003)
PING if-0.n-6.g-7 (192.168.58.48) from 192.168.58.1 : 56(84)
bytes of data.
From if-1.n-1.g-0 (192.168.58.1): Destination Host
Unreachable
--- if-0.n-6.g-7 ping statistics ---
4 packets transmitted, 0 packets received, +1 errors, 100%
packet loss
1 serial (Tue Aug 5 07:38:48 2003)
STATUS: node.n-6.g-7 UNKNOWN discovered
# status_client -i -g 13
* admin-0
* \_ node.n-3.g-0
\_ node.n-0.g-13 ping(0) serial(0)
* \_ node.n-1.g-13 ping(0) serial(1)
\_ node.n-2.g-13 ping(0) serial(0)
\_ node.n-3.g-13 ping(0) serial(0)
\_ node.n-4.g-13 ping(0) serial(0)
\_ node.n-5.g-13 ping(0) serial(0)
\_ node.n-6.g-13 ping(0) serial(0)
\_ node.n-7.g-13 ping(0) serial(0)
Another useful option is --test. It allows you to cut down the tests that are being displayed. So, to see all the failed serial tests, with their full text, we do:
# status_client -vq --test serial * admin-0 * \_ node.n-0.g-0 * | \_ node.n-6.g-1 * | \_ node.n-2.g-3 | | 1 serial (Tue Aug 5 07:36:34 2003) | | STATUS: node.n-2.g-3 UNKNOWN discovered | | * \_ node.n-1.g-0 * | \_ node.n-6.g-6 | | 1 serial (Tue Aug 5 07:38:45 2003) | | STATUS: node.n-6.g-6 UNKNOWN discovered | | * | \_ node.n-6.g-7 | | 1 serial (Tue Aug 5 07:38:48 2003) | | STATUS: node.n-6.g-7 UNKNOWN discovered | | * \_ node.n-3.g-0 * | \_ node.n-1.g-13 | | 1 serial (Tue Aug 5 07:38:50 2003) | | STATUS: node.n-1.g-13 UNKNOWN discovered | | * | \_ node.n-3.g-15 | | 1 serial (Tue Aug 5 07:38:58 2003) | | STATUS: node.n-3.g-15 UNKNOWN discovered | | * \_ node.n-4.g-0 * | \_ node.n-1.g-19 | | 1 serial (Tue Aug 5 07:40:52 2003) | | STATUS: node.n-1.g-19 UNKNOWN discovered | | * \_ node.n-5.g-0 * | \_ node.n-3.g-22 | | 1 serial (Tue Aug 5 07:38:38 2003) | | STATUS: node.n-3.g-22 UNKNOWN discovered | | * \_ node.n-0.g-24 * | \_ node.n-2.g-24 | | 1 serial (Tue Aug 5 07:35:42 2003) | | STATUS: node.n-2.g-24 UNKNOWN discovered | | * \_ node.n-0.g-25 * | \_ node.n-20.g-25 | | 1 serial (Tue Aug 5 07:36:25 2003) | | STATUS: node.n-20.g-25 UNKNOWN discovered | |
Additional work that could benefit users of the status daemon includes integration with the command daemon, additional tests, and GUI development. Command daemon integration would allow commands to be run more efficiently across the system be leveraging current status of nodes to avoiding misbehaving nodes and even save effort if recent (within a minute or so) data is good enough. Additional tests, especially of Myrinet switches and file systems, could provide even more useful data to be analyzed in conjunction with the other system tests. GUI work could include any number of interfaces, including an HTML page describing the current state of the system and/or adding the ability to get status in the current cgui program.