spacer
Supercomputer Event Logs
Sandia Home

If we hope to automatically detect and diagnose failures in large-scale computer systems, we must study real deployed systems and the data they generate. Progress has been hampered by the inaccessibility of empirical data. This site addresses that dearth by providing system logs from five supercomputers.  By distributing these logs we hope to facilitate reproducible research in log analysis, specifically toward increasing supercomputer reliability, availability, and serviceability (RAS).

A preliminary analysis of these logs can be found in:

 Adam Oliner and Jon Stearley. What Supercomputers Say - An Anaysis of Five System Logs.   IEEE/IFIP Conference on Dependable Systems and Networks (DSN), 2007.  [paper, presentation]

System Start Date Days Size (GB) Compressed (GB) Rate (bytes/sec) Messages Alerts Alert Categories
Blue Gene/L 2005-06-03 215 1.207 0.118 64.976 4,747,963 348,460 41
Thunderbird 2005-11-09 244 27.367 5.721 1298.146 211,212,192 3,248,239 10
Red Storm 2006-03-19 104 29.990 1.215 3337.562 219,096,168 1,665,744 12
Spirit (ICC2) 2005-01-01 558 30.289 1.678 628.257 272,298,969 172,816,564 8
Liberty 2004-12-12 315 22.820 0.622 835.824 265,569,231 2,452 6

See this README for md5sums, format description, and other useful information.

Further analysis has revealed additional alert types, as well as a small number of incorrectly tagged lines. The updated tagging can be reproduced using the tools here.   Detailed description of the revision and analysis of the resulting logs is given in "Alert Detection in System Logs", IEEE International Conference on Data Mining (ICDM), 2008.

These and a wide variety of other systems logs are available at http://cfdr.usenix.org/.

Defining Supercomputer RAS | The Sisyphus Log Analysis Toolkit

© Sandia Corporation | Site Contact | Site Map | Privacy and Security