CCIM Sandia National Laboratories
Rolf's Page
LaTeX SAND Report
Current Projects
Publications
Talks and Posters
Previous Projects
Software

Application Model

The current version is 1.005, which was used for our paper currently in submission.

New in this version are the command line option --scale, some important bug fixes, and minor tweaks in the output format.

As part of the resiliency project I wrote a small simulator that mimics an application as it goes through its work, checkpoint, restart, and rework phases. This page describes this model in more detail and tells you how to download, build and use it.

The program was initially called app_model and open source licensed under that name. Since it is neither a model, not does it model or simulate an application, I renamed the executable to two_step to reflect that we are really modeling or simulating the checkpoint/restart process: Make progress, then step back to a previous position and start moving forward again.

Overview

Long-running applications use checkpoints and restarts to span system faults and other interruptions. After a fault, an application restarts from its most recent and successful checkpoint, and continues its computation. All this adds overhead and extends the total run time beyond what is needed to complete the work.

Calculating the total run time is difficult because it depends on the number of interruptions, the amount of time between checkpoints, the duration of writing checkpoints, how faults are distributed over time, and how long it takes to restart. However, knowing the total run time is important to evaluate a system's throughput (how many applications it can process in a unit of time), to improve job scheduling, and to assess the impact of resiliency measures, such as redundant computation.

The two_step program simulates an application as it moves between the various stages of execution: working, writing checkpoints, restarting, and redoing lost work. The two_step uses a random number generator to simulate faults during an application's execution and keeps statistics about the number of faults and time spent in the various phases of execution. At the end of the simulation, two_step prints these statistics as well as the total run time of the application. The two_step program can also simulate redundant computation and calculate the overhead and time savings of that technique

Lifetime of an Application

The picture below illustrates the phases an application goes through in order to complete a certain amount of work. After it starts running, the application does some of its work and then pauses to write a checkpoint to stable storage. Once it has written a checkpoint, the application continues with its work.

The lifetime of an application

When an interrupt (application failure) occurs, the application restarts and has to redo the work that was lost since the last successful checkpoint. Making progress on the work load continues after that until it is time again to write another checkpoint, or a new interruption occurs.

Interrupts can occur during any phase of the application. If an interrupt occurs during a work phase, the work since the last successful checkpoint is wasted and needs to be redone after the application restarts.

From the above figure it is clear that an application, even if it gets interrupted rarely, will need more time to complete than the actual amount of work suggests. If interrupts are frequent, then a considerable amount of the total application time may be used for checkpoints, restarts, and rework.

Simulating an Application

The two_step program mimics the interaction of an application with the system it runs on. The state diagram below shows that the program, just like a real application, transitions from working on a problem, to check-pointing, and back to working again.

State diagram of the application model

When an interrupt occurs, either during a work or checkpoint phase, a restart from the last successful checkpoint is initiated. The work that was lost since the last checkpoint has to be redone in the rework phase. After that, the regular cycle of work and check-pointing continues.

The transitions to the checkpoint state occur whenever the checkpoint interval timer expires. That timer is reset in the checkpoint state. two_step uses Daly's equation to calculate the optimal checkpoint interval. When the amount of successfully completed work reaches the simulated workload, the simulator enters the end state and stops. At that point it outputs the amount of time spent in each state and various other statistics it accumulated during the run.

One of the parameters to control the simulator is the MTBF of the simulated system. The simulator generates random events that are exponentially distributed around the MTBF. Each event is a fault that is feed into a node model. The model determines which node has failed and whether the application receives an interrupt or can continue doing its work. If the model determines that an application interrupt should occur, it forces a transition to the restart state in the state diagram above. These transitions are indicated by the exclamation point signs in the diagram. We are assuming that for a restart, the same number of nodes will be available again. This is the same assumption that is made for applications using checkpoint/restart running without redundant nodes.

The MTBF parameter for the application simulator is the MTBF of a single node. The fault generator within the simulator generates faults for the individual nodes. Therefore, we expect the overall MTBF, the system MTBF, to be much smaller than the one for an individual node. The block diagram of the simulator is shown below. The state machine requests the next time an application interrupt will occur from the rMPI model. (That model resents a redundant MPI library we have written that allows for redundant nodes.) The fault generator generates exponentially distributed faults for each node. The rMPI model then determines at what time to cause an application interrupt. It will be the earliest time both nodes in a bundle have failed. The application interrupt times are fed into a analysis module which computes the system MTBF and prints various statistics about the run.

Block diagram of the application simulator.

Compiling

The included Makefile should build the code on most Unix systems. It has been tested on several Linux versions. The code requires the GNU Scientific Library (gsl). Which is not always installed by default. You can obtain it from www.gnu.org/software/gsl, if necessary.

The Makefile creates an executable name two_step (as in two steps forward, one step back, mimicking an application's march towards completion ;-)

Usage

The program understand the options below. None of them are mandatory. Long options; e.g., --soft_reboot, can be truncated as long as they are still distinct from all other options; e.g., --soft.

Short Option Description
-w --work_time HOURS The application will complete work for that many hours on each node (weak scaling). Default is 168 hours (one week). The elapsed time of the application will be higher due to checkpoint/restart overhead and application interrupts.
-n --num_bundles NUM Number of nodes the application runs on. Default is 512 nodes. These are the nodes an application "sees", and does not include any redundant nodes that would not be visible to the application.
-r --redundant NUM Number of redundant nodes: 0...N, where N is the number of bundles (-n) selected. 0 means no redundant computing at all, and N means each node in the system has a redundant partner. Values in between are acceptable if an application is run in partial redundant mode. Default is no redundant nodes (0).
-v --verbose This option may be repeated for increased verbosity. Mostly useful for debugging and to observe the inner workings of the program.
--distribution DIST Selects the random distribution function for the fault generator. DIST can be exp (default), gamma, or weibull.
--scale a Set the scale parameter for the gamma and weibull distribution. The default is 43800.000 hours.
--shape b Set the shape parameter for the gamma and weibull distribution. The default shape parameter b is 0.5 and must be > 0.
-s --seed Use a fixed seed for the random number generator. This is useful to repeat experiments with the same start conditions. Without this option, a random seed based on the current time and PID of the process is used, which results in different results for each run.
-c --checkpoint_time MINUTES Amount of time needed to checkpoint an application. Default is 5 minutes. This is currently fixed, although it should vary with application size and the I/O characteristics of the system.
-R --restart_time MINUTES Time to restart and application. Default is 10 minutes. The program currently assumes that the application is restarted right away after an interrupt and spends this many minutes reading in the last successful checkpoint. Wait time in a batch queue is not considered here. After this restart time, any work lost since the previous checkpoint is redone before the regular work phase can be entered again. This is currently fixed, although it should vary with application size and the I/O characteristics of the system.
-t --tau MINUTES Checkpoint interval; i.e., the elapsed time before the next checkpoint is written. Unless specified, this time is computed using Daly's equation for the optimum checkpoint interval.
-m --mtbf_node HOURS The Mean Time Between Failures (MTBF) of a single node. Default is 5 years (43800 hours). It is assumed that the MTBF and failure characteristics is the same for all nodes.
--mtbf_sys HOURS The system MTBF is calculated by default and used for comparison at the end of the simulation. This is the MTBF of the system; i.e., the time between faults. The application may have a larger MTBF, if redundant nodes are used.
-a --mtbf_app HOURS By default the application MTBF is calculated based on the node MTBF. It is used to calculate the optimal checkpoint interval. This is the time between interrupts the application experiences, which may be different from the system MTBF, if redundant nodes are used. If provided, it will override node MTBF for that calculation. The node MTBF will still be used for the random fault generator.
-d --delay_ras MINUTES Multiple faults within this time count as one. Sometimes multiple components fail at nearly the same time, or cause other failures. This option allows the cumulation of all faults within this time interval into a single event. Individual faults are still reported at the end, but the application does not go through a full restart/rework/interrupt cycle for each one of them.
--finterrupts FILENAME For each application interrupt write the interrupt time (in hours since application start) and the number of faults since the last interrupt. If no redundant nodes are present, each fault leads to an interrupt. Specifying "-" directs output to stdout.
--ffaults FILENAME For each node fault write the fault time (in hours since application start) to the specified file. Specifying "-" directs output to stdout.
--soft_reboot <success>,<reboot time> By default failed nodes are not reused. With this option it is possible to reboot nodes after each fault. The probability that they reboot successfully and can be reintegrated into the computation can be specified in the range 0%...100%. It takes a certain number of minutes for a node to reboot. Both values must be specified. E.g., --soft_reboot 50,10 specifies that on average half of the nodes can be reintegrated (50%), and it takes 10 minutes for each node to boot. If the other node receives a fault during those 10 minutes, the bundle, and the application will fail. After the 10 minutes and a successful re-integration, redundancy is restored.
--help Short information about the command line options.
--input FILENAME Instead of letting a random number generator create node faults, application interrupts can be read from an input file. The file should have three values per line: The time, in seconds from application start, when the interrupt occurs; the node that causes the interrupt; and a string describing the error. The fields are separated by white space.

The node number is checked but not further used. The fault description string must be present, but is ignored. This format was chosen to make it easy to process fault logs that can be found on the Internet. They may need to be pre-processed, but can be easily converted into the format required here.

-p --performance Display performance data about the simulation itself.

Output

A typical run and its output are shown below. The line by line explanation follows.

    00	./two_step -n 100000 -r 100000 -w 720 -p
    01		Version 1.005
    02		Command line "./two_step -n 100000 -r 100000 -w 720 -p"
    03		PARAMETERS
    04		  Active nodes                 100000
    05		  Redundant nodes              100000
    06		  Total nodes                  200000
    07		  Checkpoint duration            5.00 minutes
    08		  Restart duration              10.00 minutes
    09		  Work to be done              720.00 hours
    10		  Node MTBF                  43800.00 hours
    11		  File for interrupt times            ""
    12		  File for fault times                ""
    13		  Seed for pseudo random generator    random
    14		  Fault distribution:                 exponential
    15		  RAS delay                      0.00 minutes
    16		  Soft reboot time                    not used
    17		
    18		CALCULATED
    19		  System MTBF                    0.22 hours (13.140 minutes)
    20		  Application MTBI             122.90 hours (7373.718 minutes)
    21		  Checkpoint interval            4.47 hours (268.223 minutes)
    22		  Faults/interrupt             561.17
    23		WARNING: checkpoint + restart time > system MTBF! This may take quite a while.
    24		
    25		SIMULATION
    26		  Application completed        720.00 hours of work (100.00% of work to be done)
    27		  Elapsed time                 740.25 hours (Overhead is  2.81%)
    28		  Total restart time             0.67 hours      (  0.09%)
    29		  Total rework time              6.17 hours      (  0.83%)
    30		  Total work time              720.00 hours      ( 97.26%)
    31		  Total checkpoint time         13.42 hours      (  1.81%)
    32		  Total RAS delay                0.00 hours      (  0.00%)
    33		    -----------------------------------------------------
    34		    Totals                     740.25 hours      (100.00%)
    35		
    36		  Number of restarts:            4    Failed:          0
    37		  Number of rework:              4    Failed:          0
    38		  Number of work segments:     162    Failed:          4
    39		  Number of checkpoints:       161    Failed:          0
    40		    ----------------------------------------------------
    41		                                      Fails:           4
    42		                                      Interrupts:      4
    43		
    44		  Faults:                     3364
    45		  Failed nodes:               3364    Repaired:     3088 (276 nodes to be repaired after app completion)
    46		  Successful soft reboots:       0    Failed:          0 (0.00%)
    47		  Avg faults per int:      841.000,   49.87% over calculated 561.17
    48		  System MTBF                    0.22 hours (13.203 minutes), 0.48% over calculated 0.22 hours
    49		  App. MTBI                    185.06 hours (11103.783 minutes), 50.59% over calculated 122.90 hours
    50		  Modeled elapsed time         748.19 hours (44891.581 minutes), 1.07%, over simulated 740.25 hours
    51		
    52		PROGRAM PERFORMANCE INFORMATION:
    53		  Generated 203092 random numbers and 0 random probabilities
    54		  Calls to rMPI() 5
    55		  Read 0 faults from input file, accepted 0 (0.00%)
    56		  Time to model this application:  0h:00m:0.015353
		    

Line by Line Description.

Line 00
The executable is called two_step. No command line options are mandatory.
Line 01
Version of the executable.
Line 02
Capture the command line given.
Line 03
Starting here is a list of parameters. These are either defaults or were specified on the command line.
Line 04
The number of bundles being simulated.
Line 05
The total number of redundant nodes, in addition to the active nodes in each bundle.
Line 06
The total number of nodes: active nodes + redundant nodes.
Line 07
The time necessary to write a checkpoint.
Line 08
The time necessary to restart an application.
Line 09
The amount of work to be done (on each node).
Line 10
The node MTBF.
Line 11
File name to write interrupt times and number of faults. "" means no output will be generated.
Line 12
File name to write fault times. "" means no output will be generated.
Line 13
Seed used for pseudo random generator. Either "random" or "fixed".
Line 14
Random number distribution. Exponential, Gamma, or Weibull.
Line 15
Time interval within which all faults are considered to be simultaneous. If > 0 avoids an application restart for each interrupt, if they are only a few seconds apart.
Line 16
Time to reboot a node after a fault and probability of a successful re-integration.
Line 18
Calculated values from the parameters known so far.
Line 19
Estimated system MTBF; i.e., time between node faults.
Line 20
Estimated application MTBF; i.e., time between application interrupts.
Line 21
Calculated (optimal) checkpoint interval, based on Daly's equation.
Line 22
Estimated faults per application interrupt.
Line 23
A (false in in this case) warning that the simulation may require longer than usual.
Line 25
Following are the observed values after the simulation has completed.
Line 26
The amount of work completed. Should always be 100%.
Line 27
The elapsed time, which includes work, checkpoints, restarts, and redoing of lost work.
Line 28
Time and percentage spent in restart phase.
Line 29
Time and percentage spent in rework (lost work) phase.
Line 30
Time and percentage spent in work phase. Should be number of work hours specified on command line (or default).
Line 31
Time and percentage spent check-pointing.
Line 32
Time and percentage spent waiting for fault bursts to pass (-d option). (Not well tested.)
Line 34
Total should be equal to elapsed time and 100%.
Line 36
Number of restarts and how many of those failed.
Line 37
Number of rework phases and how many of those failed.
Line 38
Number of work phases and how many of those failed.
Line 39
Number of checkpoints and how many of those failed.
Line 41
The total of failed phases listed above. This and the next line below it should always match.
Line 42
Number of interrupts the application experienced.
Line 44
Total number of node faults the system experienced.
Line 45
How many nodes failed during the entire simulation. Should match the number of faults seen above. The number of repaired nodes is the same, if no redundant nodes are present. It might be less with redundant nodes, since the application was running with some broken nodes that can be repaired after the run completes.
Line 46
Statistics about rebooting nodes and re-integration.
Line 47
Observed and calculated number of faults per interrupt. For non-redundant runs, this should always be 1. The large mismatch between calculated and simulation is due to the small number of total interrupts: only four in this example.
Line 48
Observed system MTBF and comparison to estimate above.
Line 49
Observed application MTBF and comparison to estimate above. Again, off by quite a bit due to the low number of interrupts.
Line 50
Calculated elapsed time using Daly's model and comparison to simulation run.
Line 52
Some performance data about the simulation itself. Triggered by the -p option.
Line 53
Number random numbers generated (for faults) and number of random probabilities generated (for soft reboot success).
Line 54
Number of calls to the rMPI() function.
Line 55
How many faults (application interrupts) were read from the input file.
Line 56
Wall-clock time of this simulation run.

Download

The source code has undergone copyright assertion and has been released as open source and can be downloaded from www.cs.sandia.gov/web1400/1400_download.html.

Please, if you don't mind, drop me a note if you find this program useful, have suggestions for improvements, or use results from it in a project or publication. Thanks.

Papers

The Sandia Technical Report 2009-6753 describes two_step (under the old name app_model) and presents data generated from it. For example, the picture below shows how much time a 168-hour application spends in each phase, based on the number of nodes it runs on. The more nodes in use, the more likely that one of them will fail and cause an application restart.

Time in each phase of a 168-hour application run.

Top of page

Image placeholder

Page Contact

Rolf Riesen, Ph.D.
Scalable Computing Systems
Sandia National Laboratories
Org. 1423, MS 1319
Albuquerque, NM 87185-1319

email: rolf.riesen@gmail.com
CV: cv.pdf


Related Links

My LinkedIn profile.

Web page at UNM.

Personal web page.