Title: Root Cause Analysis of Errors for High Performance Computing

Speaker: Joel Vaughan, SIP, Sandia National Laboratories

Date/Time: Wednesday, September 9, 2009, 1:00 – 2:00 pm        

Location: CSRI Building, Room 90 (Sandia NM)

Brief Abstract: A supercomputer is a complex network of CPUs, switches, routers, and cables.  When an error occurs, such as a job halting, a message being dropped, or a corrupted message arriving, the root cause of the error may be difficult to assess.  When multiple errors occur over different jobs, users, executables, and subsets of the network, it may be possible to combine the data to gain insight into likely root causes.  Currently, the process of locating the root cause of these faults is carried out by system administrators, who use - expertise with a particular system to comb through the logs and determine the most likely causes.  However, as supercomputers grow in size and complexity, this process will become more costly, both in terms of resources spent to isolate the source of the faults, and the compute time lost as the failure is corrected. We present a statistical method to assist in determining the root cause of failures.  The method is discussed, and real failures on a current production system are analyzed. 

CSRI POC: Scott Mitchell, (505) 845-7594



©2005 Sandia Corporation | Privacy and Security | Maintained by Bernadette Watts and Deanna Ceballos