Table of Contents

Name


yod -- Load a Cplant parallel application

Synopsis


yod {yod-options} program-path-name {program-arguments}

OR

yod {yod-options} load-file-name

YOD-OPTIONS:

[-alloc] [-attach] [-bt ][-D] [-d info-type] [-file file-name]

[-help | -vhelp] [-interactive | -batch][-kill ] [-list node-list]

[-Log] [-nid node-number -pid portal-id][-NOBUF] [-quiet] [-show][-sleep where]

[-strace path-name] [-straceoptions option-list] [-stracenodes rank-list]

[-sz nodes] [-timing]

Description


yod is a utility that loads a parallel application onto a set of compute nodes. File operations performed by the compute node processes (if not directed to a parallel IO facility) are transparently forwarded to yod which executes the operations and returns the results to the application. yod exits when each member of the parallel application has exited.

Here is a typical use of yod. It loads myCode on 64 nodes, and passes the command line argument -i input.dat to each process of the parallel program.

yod -sz 64 myCode -i input.dat

The program-arguments, along with your environment, will be sent along to the compute node processes. The standard input of yod is the standard input of the compute node processes. The standard input is not duplicated, so if node 0 reads some bytes from standard input, the next read of standard input from any node in the parallel application will get the next bytes in the stream.

It is possible to send a SIGUSR1 or SIGUSR2 to a parallel application by sending the signal to yod. yod will forward the signal to the user application processes. (Type kill -s SIGUSR1 yod-pid on the node running yod to send the application processes a SIGUSR1.)

Interrupting yod with CRcontrol-c causes it to interrupt the application processes with a SIGTERM. yod will await completion messages from the compute nodes. If yod seems stuck, interrupt with CRcontrol-c again. This will cause yod to interrupt the application processes with a SIGKILL. If yod still seems stuck, interrupt with CRcontrol-c a third time. yod will simply reset the compute nodes and exit.

An alternative to killing a job through yod is to run pingd -reset -mine to reset the compute nodes hosting your application. Your application processes will be sent a SIGKILL, and the compute nodes released for other users. You may use the command pingd -interrupt -mine to send a SIGTERM to all of your parallel applications. See the pingd man page for other ways to specify nodes or jobs for the command to act upon.

When loading a single executable file onto the compute partition, list the executable path name followed by your program arguments on the yod command line. To load more than one executable file, or to specify different command line arguments to different processes, (heterogeneous load) specify the command lines in process rank order in a load file. List the load file name as the argument to yod.

Load File Format


Your load file is a text file you create with your favorite text editor. It has two kinds of entries: comments and application members. Comments are lines on which the first text that appears is a pound sign (#). These are ignored by yod. The other type of entry lists a member of the parallel application and has this format:

{yod-options} program-path-name {program-arguments}

The only yod options accepted in a load file are -sz and -list.

Example:

yod -l 100..200 myLoadFile

The contents of myLoadFile are listed here:

#
# load file to run my computation and parallel vis server
#

-sz 2 -l 500,501 my-vis-code bufsize=2048
-sz 64 my-computational-code

In this example, the executable file my-vis-code will be loaded on nodes 500 and 501, will be passed the argument bufsize=2048, and will be ranks 0 and 1 in the parallel application. The executable file my-computational-code will be loaded on 64 free nodes found in the node number list 100 through 200. These processes will have ranks 2 through 65 in the parallel application. MPI users note that the 66 processes described will populate a single MPI_COMM_WORLD on application start up.

If a load file is provided, any size argument given on the yod command line is ignored. If there is no node list given in the load file for a member, then the node list given on the yod command line will be used. If in addition there is no node list given on the yod command line, then the requested nodes will be allocated from anywhere among the general collection of free nodes. If there is no size argument provided in the load file, but a node list is provided, it will be assumed that you want all the nodes in the node list. If there is no size argument provided in the load file and also no node list, it will be assumed that you want one node from anywhere.

Yod Options


-alloc
Choosing -alloc was useful for compute node debugging before the availability of cgdb or Totalview. It displays the nodes on which your application has been started and waits for you to press a key before allowing the processes in your parallel application to procede out of system code and into user code. You could at this point log in to a compute node and attach a debugger to your application to catch it before it procedes to main. Since users are discouraged from logging into compute nodes, it would be better for you to use -attach and cgdb. Also see the -bt option of yod.

-attach
This option is essentially the same as -alloc. It is intended to hold the application processes once they have started executing at an instruction prior to user code (prior to main). You can at this point start cgdb to attach a debugger to a process. See the cgdb man page for more help on debugging compute node processes.
-batch
This option informs yod that it is not being run interactively. In this case, yod will not wait for user responses in certain circumstances. For example, if one of your application processes terminates abnormally (with a non-zero exit code or as the result of a signal), yod will automatically kill your parallel application for you. Normally your application is not killed if some processes are still running. The default is that you are not running in batch mode. See -interactive.

-bt
This option will cause yod to display a stack trace for user processes that terminate abnormally. yod normally displays a one-line completion message for each process in you r parallel application, listing the exit code or terminating signal if any. If the completion message indicates that your application process terminated with a signal and you wish to investigate, you may rebuild your application with debugging symbols and re-run it with the -bt option of yod. The PCT will then attach a debugger to your process, collect the stack trace when it faults, and send the stack trace to yod for display.

-D
Turn on debugging of the application load. The steps in the load protocol are displayed as the application load progresses. Application process file IO requests are displayed as yod receives them.

-file file-name
When all processes in the parallel application have completed, yod displays a one line completion message for each process. This message lists the wall-clock time elapsed from start to finish for the process, and the exit code and terminating signal, if any, for the process. By default the listing goes to stdout, but may be redirected to a file with this option.

-help
-vhelp
-help option displays a usage message for yod, -vhelp displays a more verbose message.

-interactive
This option informs yod that it is being run interactively by a living user. This is the default mode. If yod is being run by a script, be certain to specify -batch on the command line. One difference between interactive mode and batch mode is that if the load fails on one node, interactive mode waits for the user to interrupt yod with control-c before cancelling the load on all allocated nodes. Batch mode goes ahead and cancels the load.

-kill
When yod is run in interactive mode (the default) and a process of a parallel application terminates abnormally, yod displays the fact that the process terminated but does not kill the other processes in the job. The user may choose to abort the job by terminating yod with control-C. If the user wishes yod to automatically kill the application when one or more processes terminates abnormally, then use the -kill option to yod.

-list node-list
If a node-list is provided on the yod command line, then the nodes requested will be allocated out of this list. If CR-sz n is specified as well, then n nodes will be allocated out of the list. If there does not exist n free nodes in the list, yod will display an error message. If no CR-sz option is specified, yod will assume you want all the nodes in the node-list. A node-list is a list of node specifiers separated by commas. A node specifier is a physical node number or a node range. A node range is specified by two physical node numbers separated by one or more dots. No white space may be included in the node-list. Example: CR-l 25..35,112..140,160,165
-Log
This option causes the compute node application load protocol steps to be logged to CR/var/log/cplant on the compute node. It is intended for use by Cplant system debuggers.

-NOBUF
yod displays it's own messages and also text printed by the parallel application processes while they are running. Normally this combination of buffered (yod's status messages) and unbuffered (application output and yod's error messages) messages appear sensibly on the tty that started yod. But if yod was started by an rsh from a remote node, the output appears garbled. The -NOBUF option solves this problem by making all yod output unbuffered .

-nid node-number
-pid portal-ID
These arguments will cause yod to contact the bebopd on the specified node number and at the specified portal ID rather than the bebopd listed in the CRcplant-host file. This option is only for testing alternative bebopds and should be used only by Cplant developers.
-quiet
yod, like this man page, is quite verbose. It lists many status and error messages as it loads and runs a parallel application. If you wish to have these messages suppressed, run yod with the -quiet option.
-show
Cplant parallel applications are encoded with a version string. yod will not load an application encoded with the wrong version string (unless you run yod with the secret -xxx option). The -show option lists the correct version string and the version string found in your executable.

-sleep where
Cplant system debuggers may want to attach a debugger to a Cplant application before it is in user code. This option provides 4 different points at which a the processes can be held for 60 seconds. The options are -sleep 1 (right after the fork), -sleep 2 (just before the exec), -sleep 3 (right after entering system startup code), -sleep 4 (just before proceeding to main).

-strace path-name
Yet another debugging tool. path-name should be a directory which is mounted writable on the compute node. This option will cause the PCT to run the application process under strace which will list all system calls (and their arguments) made by the application process. By default, only the rank 0 process is traced. The strace output goes to a file in directory path-name. The file name contains the Cplant job ID and the rank of the process being traced.

-straceoptions option-list
The PCT will invoke strace with the options you specify in the quoted string option-list. You must use the -strace option with this option.

-stracenodes rank-list
The PCT will invoke strace on the processes with the ranks given in the rank-list. The format for the rank-list is the same as the format for a node list. By default, strace is invoked only on the rank 0 process. You must use the -strace option with this option.

-sz nodes
The number of compute nodes required to run the parallel application. One member (process) of the application will run on each node. The default if no node list is specified is CR-sz 1. The default if a node list is specified is the number of nodes in the node list.

-timing
Interested in how long the different stages of application load are taking? The -timing option times them and displays the results in seconds. (If our name was mpirun instead of yod we would display it in minutes!)

Return Values


yod returns 0 if the parallel application terminated normally, 1 if the application ran and terminated abnormally, and 2 if the application load failed and the application never started. Abnormal termination occurs if one or more of the processes of parallel application exited with a non-zero exit code, or was terminated by a signal.

Environment Variables


Environment variables that affect yod's behavior are described here.

Occasionally a load will fail because a compute node allocated to your parallel application is not working. yod will try to obtain a new set of nodes and load again. It will try up to three times. If you want to decrease or increase the number of retries, set the value of the environment variable YODRETRYCOUNT to the number of times yod should retry the load.

If you do not specify the full path of the executable name, yod will search first for the executable in the current working directory. If it is not found, yod will use the PATH variable in your environment to search for the executable.

When yod is executed from a PBS job script, there are certain variables defined that are required by the runtime system. If you do something sneaky in your PBS job script like rsh to another service node and run yod there, be sure to set these environment variables in the new shell to the same value they have in the original shell: PBS_ENVIRONMENT, PBS_BATCH, PBS_JOBID, PBS_NNODES.

See Also


pingd PCT bebopd cgdb


Table of Contents