Table of Contents
pingd -- The all-purpose interface to the node allocator and compute node
PCTs. It displays status of a running parallel application, and interrupts
or kills your parallel application. Plus for administrators only: reserves
nodes, kills PCTs, notifies bebopd of death of PCT, changes bebopd level
of PBS support
pingd [-fast|-interrupt|-kill|-reset|-gone|-reserve user-id|-unreserve]
[-job job-ID|-pbsjob job-ID|-mine|-list node-list|-all]
[-verbose] [-help|-xtrahelp]
[-parse]
[-nid node-number -pid portal-id]
[-NoInquire] [-summary]
[-PBSsupport [ on | off ] ] [-PBSupdate [ on | off ] ] [ -PBSinteractive [ n ] ]
By default, pingd (ping the daemon) contacts a bebopd (the Cplant node
allocator daemon) and obtains and displays information about the status
of compute nodes and the jobs they are running. System administrators
can use the kill function of pingd to kill PCTs (the Cplant compute node
daemons), or the gone function to notify the bebopd that a PCT is dead.
The reset function allows users to kill their jobs (with a SIGKILL to every
compute node process) and reset the compute nodes hosting their jobs to
FREE status. The interrupt function allows users to interrupt their jobs
with a SIGTERM to every compute node process. System administrators can
reset or interrupt any compute node.
- -all
- Perform requested operation (like query, reset, interrupt)
on all nodes in the compute partition. This is the default. To limit the
operation, use -list, -job, -pbsjob or -mine.
- -fast
- By default, the bebopd
queries all compute nodes for their status before reporting back to pingd
(unless it queried them all very recently). For a faster display, -fast
queries the bebopd for it's most recent update from the compute partition.
- -gone
- It is possible that a PCT can terminate without notifying the
bebopd. Use -gone to notify the bebopd that a PCT has disappeared from
a node.
- -help
- -xtrahelp
- Display a list of pingd options and how to use
them. -xtrahelp provides a more verbose message.
- -interrupt
- Send a SIGTERM
to the parallel application on each specified node. System administrators
can interrupt any application. Users can only interrupt their own jobs.
- -job job-ID
- -pbsjob job-ID
- Limit the function to compute nodes running
job number job-ID. A PBS (Portable Batch System) job may start several Cplant
parallel applications. The job as a whole has a PBS job ID, and each application
has a Cplant job ID. Use the -pbsjob option to specify a PBS job ID, and
use the -job option to specify a single Cplant parallel application.
- -kill
- System administrators can kill PCTs with this option.
- -list node-list
- Perform the requested operation on the specified list of nodes. Node
specifiers are delimited by commas. A node specifier is a physical node
number or a node range. A node range is specified by two physical node
numbers separated by one or more dots. No white space may be included in
the node-list. Actually, the node-list may be specified without the -list
option specifier. If pingd finds something on it's argument line without
an option specifier that can be parsed as a node-list, it will assume it's
a node-list.
- -mine
- Perform function on compute nodes running my jobs
only.
- -nid node-number
- -pid portal-ID
- These arguments will cause pingd
to contact the bebopd on the specified node number and at the specified
portal ID rather than the bebopd listed in the CRcplant-host file. This
option is only for testing alternative bebopds and should probably be used
only by Cplant developers.
- -NoInquire
- Normally pingd displays an are
you sure sort of prompt before interrupting, resetting or killing nodes.
Use the NoInquire option to make pingd skip this step.
- -parse
- This
option causes pingd to list it's output in an easily parseable format.
- -PBSsupport [on|off]
- -PBSupdate [on|off]
- PBS (Portal Batch System) can rely on the Cplant
bebopd node allocator to tell it how many live compute nodes are in the
machine. This number may change if nodes crash or if non-PBS jobs complete
sometime after PBS has started managing the machine. The bebopd is running
in PBSsupport mode if it is keeping track of the number of live compute
nodes in the machine and policing PBS users to ensure they use no more
nodes than they were allocated. The bebopd is running in PBSupdate mode
if in addition it sends updates to the PBS server whenever the number of
live compute nodes changes. These two arguments can be used to turn on
or off PBSsupport and to turn on or off PBSupdate. Since PBSupdate implies
PBSsupport, turning on PBSupdate automatically turns on PBSsupport, turning
off PBSsupport automatically turns off PBSupdate.
- -PBSinteractive n
-
The bebopd can reserve n nodes for interactive use. PBS will not be able
to schedule these nodes for batch jobs. This option sends a request to
the bebopd to reserve n nodes for interactive use.
- -reserve user-id
-
System adminstrators can reserve a node for a particular user with this
option. The argument is either a user name or numeric user ID.
The bebopd will allow a job running on the node to complete, but
will refuse to allocate the node to anyone other than the specified user.
To free the node, use the -unreserve option. This option should
only be used to debug troubled nodes. Taking nodes away can cause
jobs to fail when jobs are being scheduled by PBS.
- -reset
- Reset the selected nodes. This option kills the application process
(with SIGKILL), and resets the PCT to available status. System administrators
can reset any node. Users can only reset nodes running their jobs.
- -summary
- Rather than displaying a line per node, just display the totals.
- -unreserve
-
Use this option to free a node that has been reserved for a particular
user. A job running on the node will not be disturbed.
- -verbose
- Display extra information
about running jobs. You will not see this information about other people's
jobs unless you are a very special user.
pingd assumes the most restrictive interpretation of which nodes are
specified. If you provide a list of node numbers, a job ID, and specify
-mine, pingd will perform the operation on the nodes in the list which are
running the job specified, if you own it. To list the current status of
all nodes in the compute partition:
pingd
To list the status most recently reported to the service partition of
all nodes in the compute partition (without going out and querying the
compute parition):
pingd -fast
To kill the PCT on node 20:
pingd -kill -l 20
To reset the PCTs on nodes 0 through 100 which are running my jobs, either
of these will work. (The -l option specifier may be omitted when specifying
a node list.)
pingd -reset -l 0..100 -mine
pingd -reset 0..100 -mine
To display much status information about my jobs:
pingd -m -v
To inform the bebopd that the PCTs on nodes 55, 56, 57 and 61 are dead:
pingd -gone -l 55..57,61
Some compute nodes may be slow to respond to your request, and pingd
does not wait for them. This is not an error. Run pingd again with the
-fast option to get the updates which arrived at the bebopd (service node
daemon) after your pingd display. (Running without -fast would cause the
bebopd to go out and query all the compute nodes again.)
- /cplant/cplant-host
- This file identifies the location of a bebopd
daemon.
- /var/log/cplant
- This is the log file where Cplant daemons and
utilities log status.
bebopd
Let us know if you locate any (cplant-help@cs.sandia.gov).
Table of Contents