 |
Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS)
|
 |
 |
|
Paper |
HPCS05 Presentation |
LCI05 Presentation |
RAS via Informatics
|
|
Abstract
The absence of agreed definitions and metrics for supercomputer RAS
obscures meaningful discussion of the issues involved and hinders their
solution. This paper seeks to foster a common basis for communication
about supercomputer RAS, by proposing a system state model,
definitions, and measurements. These are modeled after the SEMI-E10
specification which is widely used in the semiconductor manufacturing
industry.
|
State Hierarchy Diagram |
 |
|
Application to Red Storm:
The details necessary to apply these concepts to the Red Storm (Cray XT3)
supercomputer were presented at the 2005 Cray Users Group (CUG) meeting.
Here is the
paper
and
slides
(note slides 17 and 18). A followup report was then presented at CUG06 (paper
and
slides).
Contact:
Redstorm RAS metrics have continued to evolve but have not been published. External collaborations towards establishing meaningful standards for quantifying RAS performance are ongoing. Contact Jon Stearley <jrstear@sandia.gov> for further information.
|

© Sandia Corporation | Site Contact
| Site Map
| Privacy and Security
|