spacer
Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS)
Sandia Home
Paper | HPCS05 Presentation | LCI05 Presentation | RAS via Informatics

Abstract
The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved and hinders their solution. This paper seeks to foster a common basis for communication about supercomputer RAS, by proposing a system state model, definitions, and measurements. These are modeled after the SEMI-E10 specification which is widely used in the semiconductor manufacturing industry.


State Hierarchy Diagram
State Hierarchy Diagram


Application to Red Storm:
The details necessary to apply these concepts to the Red Storm (Cray XT3) supercomputer were presented at the 2005 Cray Users Group (CUG) meeting. Here is the paper and slides (note slides 17 and 18).  A followup report was then presented at CUG06 (paper and slides).  

Contact:
Redstorm RAS metrics have continued to evolve but have not been published.  External collaborations towards establishing meaningful standards for quantifying RAS performance are ongoing.  Contact Jon Stearley <jrstear@sandia.gov> for further information.


© Sandia Corporation | Site Contact | Site Map | Privacy and Security