Table of Contents
yod -- Load a Cplant parallel application
yod {yod-options} program-path-name {program-arguments}
OR
yod {yod-options} load-file-name
YOD-OPTIONS:
[-alloc] [-attach] [-bt ][-D] [-d info-type] [-file file-name]
[-help | -vhelp] [-interactive | -batch][-kill ] [-list node-list]
[-Log] [-nid node-number -pid portal-id][-NOBUF] [-quiet] [-show][-sleep where]
[-strace path-name] [-straceoptions option-list] [-stracenodes rank-list]
[-sz nodes] [-timing]
yod is a utility that loads a parallel application onto a set of compute
nodes. File operations performed by the compute node processes (if not
directed to a parallel IO facility) are transparently forwarded to yod
which executes the operations and returns the results to the application.
yod exits when each member of the parallel application has exited.
Here is a typical use of yod. It loads myCode on 64 nodes, and passes the command
line argument -i input.dat to each process of the parallel program.
yod -sz 64 myCode -i input.dat
The program-arguments, along with your environment,
will be sent along to the compute node processes. The standard input
of yod is the standard input of the compute node processes. The standard
input is not duplicated, so if node 0 reads some bytes from standard input,
the next read of standard input from any node in the parallel application
will get the next bytes in the stream.
It is possible to send a SIGUSR1
or SIGUSR2 to a parallel application by sending the signal to yod. yod
will forward the signal to the user application processes. (Type kill -s
SIGUSR1 yod-pid on the node running yod to send the application processes
a SIGUSR1.)
Interrupting yod with CRcontrol-c causes it to interrupt the
application processes with a SIGTERM. yod will await completion messages
from the compute nodes. If yod seems stuck, interrupt with CRcontrol-c again.
This will cause yod to interrupt the application processes with a SIGKILL.
If yod still seems stuck, interrupt with CRcontrol-c a third time. yod
will simply reset the compute nodes and exit.
An alternative to killing
a job through yod is to run pingd -reset -mine to reset the compute nodes
hosting your application. Your application processes will be sent a SIGKILL,
and the compute nodes released for other users. You may use the command
pingd -interrupt -mine to send a SIGTERM to all of your parallel applications.
See the pingd man page for other ways to specify nodes or jobs for the
command to act upon.
When loading a single executable file onto the compute
partition, list the executable path name followed by your program arguments
on the yod command line. To load more than one executable file, or to specify
different command line arguments to different processes, (heterogeneous
load) specify the command lines in process rank order in a load file. List
the load file name as the argument to yod.
Your load file is a text file you create with your favorite text editor.
It has two kinds of entries: comments and application members. Comments
are lines on which the first text that appears is a pound sign (#). These
are ignored by yod. The other type of entry lists a member of the parallel
application and has this format:
{yod-options} program-path-name {program-arguments}
The only yod options accepted in a load file are -sz and -list.
Example:
yod -l 100..200 myLoadFile
The contents of myLoadFile are listed here:
#
# load file to run my computation and parallel vis server
#
-sz 2 -l 500,501 my-vis-code bufsize=2048
-sz 64 my-computational-code
In this example, the executable file my-vis-code
will be loaded on nodes 500 and 501, will be passed the argument bufsize=2048,
and will be ranks 0 and 1 in the parallel application. The executable file
my-computational-code will be loaded on 64 free nodes found in the node number
list 100 through 200. These processes will have ranks 2 through 65 in the
parallel application. MPI users note that the 66 processes described will
populate a single MPI_COMM_WORLD on application start up.
If a load file
is provided, any size argument given on the yod command line is ignored.
If there is no node list given in the load file for a member, then the
node list given on the yod command line will be used. If in addition there
is no node list given on the yod command line, then the requested nodes
will be allocated from anywhere among the general collection of free nodes.
If there is no size argument provided in the load file, but a node list
is provided, it will be assumed that you want all the nodes in the node
list. If there is no size argument provided in the load file and also no
node list, it will be assumed that you want one node from anywhere.
- -alloc
- Choosing -alloc was useful for compute node debugging before
the availability of cgdb or Totalview.
It displays the nodes on which your application
has been started and waits for you to press a key before allowing the processes
in your parallel application to procede out of system code and into user
code. You could at this point log in to a compute node and attach a debugger
to your application to catch it before it procedes to main. Since users
are discouraged from logging into compute nodes, it would be better for
you to use -attach and cgdb. Also see the -bt option of yod.
- -attach
- This
option is essentially the same as -alloc. It is intended to hold the application
processes once they have started executing at an instruction prior to user
code (prior to main). You can at this point start cgdb to attach a debugger
to a process. See the cgdb man page for more help on debugging compute
node processes.
- -batch
- This option informs yod that it is not being
run interactively. In this case, yod will not wait for user responses
in certain circumstances. For example, if one of your application processes
terminates abnormally (with a non-zero exit code or as the result of a signal),
yod will automatically kill your parallel application for you. Normally
your application is not killed if some processes are still running. The
default is that you are not running in batch mode. See -interactive.
- -bt
-
This option will cause yod to display a stack trace for user processes
that terminate abnormally. yod normally displays a one-line completion message
for each process in you
r parallel application, listing the exit code or
terminating signal if any. If the completion message indicates that your
application process terminated with a signal and you wish to investigate,
you may rebuild your application with debugging symbols and re-run it with
the -bt option of yod. The PCT will then attach a debugger to your process,
collect the stack trace when it faults, and send the stack trace to yod
for display.
- -D
- Turn on debugging of the application load. The steps
in the load protocol are displayed as the application load progresses.
Application process file IO requests are displayed as yod receives them.
- -file file-name
- When all processes in the parallel application have
completed, yod displays a one line completion message for each process.
This message lists the wall-clock time elapsed from start to finish for
the process, and the exit code and terminating signal, if any, for the
process. By default the listing goes to stdout, but may be redirected to
a file with this option.
- -help
- -vhelp
- -help option displays a usage message
for yod, -vhelp displays a more verbose message.
- -interactive
- This option
informs yod that it is being run interactively by a living user. This is
the default mode. If yod is being run by a script, be certain to specify
-batch on the command line. One difference between interactive mode and
batch mode is that if the load fails on one node, interactive mode waits
for the user to interrupt yod with control-c before cancelling the load
on all allocated nodes. Batch mode goes ahead and cancels the load.
- -kill
- When yod is run in interactive mode (the default) and a process of
a parallel application terminates abnormally, yod displays the fact that
the process terminated but does not kill the other processes in the job.
The user may choose to abort the job by terminating yod with control-C.
If the user wishes yod to automatically kill the application when one
or more processes terminates abnormally, then use the -kill option to yod.
- -list node-list
- If a node-list is provided on the yod command line, then
the nodes requested will be allocated out of this list. If CR-sz n is specified
as well, then n nodes will be allocated out of the list. If there does
not exist n free nodes in the list, yod will display an error message.
If no CR-sz option is specified, yod will assume you want all the nodes
in the node-list. A node-list is a list of node specifiers separated by commas.
A node specifier is a physical node number or a node range. A node range
is specified by two physical node numbers separated by one or more dots.
No white space may be included in the node-list. Example: CR-l 25..35,112..140,160,165
- -Log
- This option causes the compute node application load protocol
steps to be logged to CR/var/log/cplant on the compute node. It is intended
for use by Cplant system debuggers.
- -NOBUF
- yod displays it's own messages
and also text printed by the parallel application processes while they
are running. Normally this combination of buffered (yod's status messages)
and unbuffered (application output and yod's error messages) messages appear
sensibly on the tty that started yod. But if yod was started by an rsh
from a remote node, the output appears garbled. The -NOBUF option solves
this problem by making all yod output unbuffered
.
- -nid node-number
- -pid portal-ID
- These arguments will cause yod to contact the bebopd on the specified
node number and at the specified portal ID rather than the bebopd listed
in the CRcplant-host file. This option is only for testing alternative
bebopds and should be used only by Cplant developers.
- -quiet
- yod, like this man page, is quite verbose. It lists many status
and error messages as it loads and runs a parallel application. If you
wish to have these messages suppressed, run yod with the -quiet option.
- -show
- Cplant parallel applications are encoded with a version string.
yod will not load an application encoded with the wrong version string
(unless you run yod with the secret -xxx option). The -show option lists
the correct version string and the version string found in your executable.
- -sleep where
- Cplant system debuggers may want to attach a debugger
to a Cplant application before it is in user code. This option provides
4 different points at which a the processes can be held for 60 seconds.
The options are -sleep 1 (right after the fork), -sleep 2 (just before
the exec), -sleep 3 (right after entering system startup code), -sleep 4
(just before proceeding to main).
- -strace path-name
- Yet another debugging
tool. path-name should be a directory which is mounted writable on the compute
node. This option will cause the PCT to run the application process under
strace which will list all system calls (and their arguments) made by the
application process. By default, only the rank 0 process is traced. The
strace output goes to a file in directory path-name. The file name contains
the Cplant job ID and the rank of the process being traced.
- -straceoptions
option-list
- The PCT will invoke strace with the options you specify
in the quoted string option-list. You must use the -strace option with
this option.
- -stracenodes rank-list
- The PCT will invoke strace on the
processes with the ranks given in the rank-list. The format for the rank-list
is the same as the format for a node list. By default, strace is invoked
only on the rank 0 process. You must use the -strace option with this option.
- -sz nodes
- The number of compute nodes required to run the parallel
application. One member (process) of the application will run on each node.
The default if no node list is specified is CR-sz 1. The default if a node
list is specified is the number of nodes in the node list.
- -timing
- Interested
in how long the different stages of application load are taking? The -timing
option times them and displays the results in seconds. (If our name was
mpirun instead of yod we would display it in minutes!)
yod returns 0 if the parallel application terminated normally, 1 if
the application ran and terminated abnormally, and 2 if the application
load failed and the application never started. Abnormal termination occurs
if one or more of the processes of parallel application exited with a non-zero
exit code, or was terminated by a signal.
Environment variables that affect yod's behavior are described here.
Occasionally a load will fail because a compute node allocated to your
parallel application is not working. yod will try to obtain a new set of
nodes and load again. It will try up to three times. If you want to decrease
or increase the number of retries, set the value of the environment variable
YODRETRYCOUNT
to the number of times yod should retry the load.
If you do not specify the full path of the executable name, yod will search first
for the executable in the current working directory. If it is not found,
yod will use the PATH variable in your environment to search for the
executable.
When yod is executed from a PBS job script, there are certain
variables defined that are required by the runtime system. If you do something
sneaky in your PBS
job script like rsh to another service node and run
yod there, be sure to set these environment variables in the new shell
to the same value they have in the original shell: PBS_ENVIRONMENT, PBS_BATCH,
PBS_JOBID, PBS_NNODES.
pingd PCT bebopd cgdb
Table of Contents