CSRI Workshop on Fault Tolerance
Sandia National Laboratories CA site, Livermore, CA
April 26-27, 2001

 
Large-scale parallel applications that run for long periods of time on hundreds or even thousands of processors have become increasingly common within DOE. Because of the large number of processors and the complexity of the codes involved, it is reasonable to expect at least one failure to occur during any single run. These applications have not been equipped to efficiently deal with failures, so when one occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting their applications, this is cumbersome and incurs substantial overhead, especially as the number of processors and the size of the problem becomes large. Thus, more efficient and more sophisticated strategies for fault tolerance are required.

In response to the growing interest in scalable techniques for fault tolerance, the CSRI is hosting a workshop at Sandia National Laboratories’ California site. The workshop will feature a series of talks be experts in fault tolerance, as well as talks on Sandia’s applications, computing platforms, and requirements for fault tolerance. The goals of the workshop are the following:

  • to encourage external experts to work on Sandia problems,
  • to raise Sandia awareness of available technology,
  • to define a list of short-term tasks for delivering fault tolerance technology to the Sandia user community, and
  • to generate a list of long-term goals for addressing fault tolerance on various ASCI and SciDAC computing platforms.
For more information contact Patty Hough at: pdhough@sandia.gov