tsunami_check - parse the log files for machine check errors




tsunami_check [ --help ] <logfile>


This script parses syslog file (or files) for Alpha node "TSUNAMI machine check" errors. For each node found to have errors, a line is printed with the following format:

  <num_errors> - <nodename> <last matching message>

If there are any non-processor correctable ECC errors, <num_errors> will be reported as:

  <total_errors> (<critical_errors>)

If tsunami_check does not return anything, then no errors were found in the specified syslog.

It is possible to scan multiple files and directories by using shell wildcards, and surrounding the path in quotes. For example:

  tsunami_check E<quot>/var/log/syslog-ng/computes/*E<quot>
  tsunami_check E<quot>/var/log/syslog-ng/*/2001.10.12E<quot>
  tsunami_check E<quot>/var/log/syslog/node1 /var/log/syslog/node2E<quot>


--help Print this manpage.


What do with the results of this script will varry depending on your site's requirements (and system vendor's RMA policy!) but here are some rough guidelines to follow:

- If nodes report 0x670, 0x680 or any other TSUNAMI errors, they should be flagged as bad right away and replaced as soon as possible. For example, node n-1.g-7 in the sample output below had 3 such errors.

- For initial hardware testing, Try to get rid of all the memory that causes more than 20 0x630 errors after an hour of memory stress testing. (See the node_hw_test documentation)

- In production, it is a little more difficult because you don't know exactly what has been running on the system. It is a sign of a problem if you see more than 500 in a day. If you see more than 100 0x630 errors per day and the node ever hangs, fails to load a job, has high myrinet error counts, or exhibits other problems then flag it as a problem as well.

- Finally, it is recommended that you watch the questionable nodes over several days. (n-7.g-6 and n-6.g-6 in the example below) A one time spike might be a sign that a rack got bumped, solar flares, AC malfunction, or other environmental factors. It's the consistently bad, and gradually getting worse ones that most likely indicate a pending critical failure and should be dealt with.


  1 - if-0.n-9.g-26 Nov 14 15:09:42 TSUNAMI machine check: vector=0x630 pc=0xfffffc000032f310 code=0x100000086
  13 - if-0.n-6.g-6 Nov 14 13:07:09 TSUNAMI machine check: vector=0x630 pc=0xfffffc0000406710 code=0x100000086
  306 - if-0.n-7.g-6 Nov 14 13:19:29 TSUNAMI machine check: vector=0x630 pc=0xfffffc000032f310 code=0x100000086
  3 - if-0.n-0.g-6 Nov 14 15:20:20 TSUNAMI machine check: vector=0x630 pc=0xfffffc0000406708 code=0x100000086
  1 - if-0.n-2.g-4 Nov 14 11:09:13 TSUNAMI machine check: vector=0x630 pc=0xfffffc0000406708 code=0x100000086
  1971 - if-0.n-3.g-4 Nov 14 17:03:22 TSUNAMI machine check: vector=0x630 pc=0x2000000d27c code=0x100000086
  171 (3) - if-0.n-1.g-7 Nov 14 13:19:13 TSUNAMI machine check: vector=0x630 pc=0xfffffc00004043b0 code=0x100000086


Syslogs on CIT systems are usually stored in /var/log/syslog-ng.


node_hw_test, node_hw_analyze, and the 'diag' module documentation.