 |
Defining and Measuring Supercomputer Reliability, Availability, and Serviceability (RAS)
|
 |
 |
|
Paper |
HPCS05 Presentation |
LCI05 Presentation |
RAS via Informatics
|
|
Abstract
The absence of agreed definitions and metrics for supercomputer RAS obscures meaningful discussion of the issues involved and hinders their solution. This paper seeks to foster a common basis for communication about supercomputer RAS, by proposing a system state model, definitions, and measurements. These are modeled after the SEMI-E10 specification which is widely used in the semiconductor manufacturing industry.
|
State Hierarchy Diagram |
 |
|
Application to Red Storm:
The details necessary to apply these concepts to the Red Storm (Cray XT3)
supercomputer were presented at the 2005 Cray Users Group meeting.
Here is the
paper
and
slides
(note slides 17 and 18).
Contact:
Contact Jon Stearley <jrstear@sandia.gov> for further information.
|

© Sandia Corporation | Site Contact
| Site Map
| Privacy and Security
|