Table of Contents

Name


pingd -- The all-purpose interface to the node allocator and compute node PCTs. It displays status of a running parallel application, and interrupts or kills your parallel application. Plus for administrators only: reserves nodes, kills PCTs, notifies bebopd of death of PCT, changes bebopd level of PBS support

Synopsis


pingd [-fast|-interrupt|-kill|-reset|-gone|-reserve user-id|-unreserve]

[-job job-ID|-pbsjob job-ID|-mine|-list node-list|-all]

[-verbose] [-help|-xtrahelp]

[-parse]

[-nid node-number -pid portal-id]

[-NoInquire] [-summary]

[-PBSsupport [ on | off ] ] [-PBSupdate [ on | off ] ] [ -PBSinteractive [ n ] ]

Description


By default, pingd (ping the daemon) contacts a bebopd (the Cplant node allocator daemon) and obtains and displays information about the status of compute nodes and the jobs they are running. System administrators can use the kill function of pingd to kill PCTs (the Cplant compute node daemons), or the gone function to notify the bebopd that a PCT is dead. The reset function allows users to kill their jobs (with a SIGKILL to every compute node process) and reset the compute nodes hosting their jobs to FREE status. The interrupt function allows users to interrupt their jobs with a SIGTERM to every compute node process. System administrators can reset or interrupt any compute node.

Options


-all
Perform requested operation (like query, reset, interrupt) on all nodes in the compute partition. This is the default. To limit the operation, use -list, -job, -pbsjob or -mine.

-fast
By default, the bebopd queries all compute nodes for their status before reporting back to pingd (unless it queried them all very recently). For a faster display, -fast queries the bebopd for it's most recent update from the compute partition.

-gone
It is possible that a PCT can terminate without notifying the bebopd. Use -gone to notify the bebopd that a PCT has disappeared from a node.
-help
-xtrahelp
Display a list of pingd options and how to use them. -xtrahelp provides a more verbose message.

-interrupt
Send a SIGTERM to the parallel application on each specified node. System administrators can interrupt any application. Users can only interrupt their own jobs.

-job job-ID
-pbsjob job-ID
Limit the function to compute nodes running job number job-ID. A PBS (Portable Batch System) job may start several Cplant parallel applications. The job as a whole has a PBS job ID, and each application has a Cplant job ID. Use the -pbsjob option to specify a PBS job ID, and use the -job option to specify a single Cplant parallel application.

-kill
System administrators can kill PCTs with this option.

-list node-list
Perform the requested operation on the specified list of nodes. Node specifiers are delimited by commas. A node specifier is a physical node number or a node range. A node range is specified by two physical node numbers separated by one or more dots. No white space may be included in the node-list. Actually, the node-list may be specified without the -list option specifier. If pingd finds something on it's argument line without an option specifier that can be parsed as a node-list, it will assume it's a node-list.

-mine
Perform function on compute nodes running my jobs only.

-nid node-number
-pid portal-ID
These arguments will cause pingd to contact the bebopd on the specified node number and at the specified portal ID rather than the bebopd listed in the CRcplant-host file. This option is only for testing alternative bebopds and should probably be used only by Cplant developers.

-NoInquire
Normally pingd displays an are you sure sort of prompt before interrupting, resetting or killing nodes. Use the NoInquire option to make pingd skip this step.

-parse
This option causes pingd to list it's output in an easily parseable format.

-PBSsupport [on|off]
-PBSupdate [on|off]
PBS (Portal Batch System) can rely on the Cplant bebopd node allocator to tell it how many live compute nodes are in the machine. This number may change if nodes crash or if non-PBS jobs complete sometime after PBS has started managing the machine. The bebopd is running in PBSsupport mode if it is keeping track of the number of live compute nodes in the machine and policing PBS users to ensure they use no more nodes than they were allocated. The bebopd is running in PBSupdate mode if in addition it sends updates to the PBS server whenever the number of live compute nodes changes. These two arguments can be used to turn on or off PBSsupport and to turn on or off PBSupdate. Since PBSupdate implies PBSsupport, turning on PBSupdate automatically turns on PBSsupport, turning off PBSsupport automatically turns off PBSupdate.

-PBSinteractive n
The bebopd can reserve n nodes for interactive use. PBS will not be able to schedule these nodes for batch jobs. This option sends a request to the bebopd to reserve n nodes for interactive use.

-reserve user-id
System adminstrators can reserve a node for a particular user with this option. The argument is either a user name or numeric user ID. The bebopd will allow a job running on the node to complete, but will refuse to allocate the node to anyone other than the specified user. To free the node, use the -unreserve option. This option should only be used to debug troubled nodes. Taking nodes away can cause jobs to fail when jobs are being scheduled by PBS.

-reset
Reset the selected nodes. This option kills the application process (with SIGKILL), and resets the PCT to available status. System administrators can reset any node. Users can only reset nodes running their jobs.

-summary
Rather than displaying a line per node, just display the totals.

-unreserve
Use this option to free a node that has been reserved for a particular user. A job running on the node will not be disturbed.
-verbose
Display extra information about running jobs. You will not see this information about other people's jobs unless you are a very special user.

Examples


pingd assumes the most restrictive interpretation of which nodes are specified. If you provide a list of node numbers, a job ID, and specify -mine, pingd will perform the operation on the nodes in the list which are running the job specified, if you own it. To list the current status of all nodes in the compute partition:


    pingd 
    

To list the status most recently reported to the service partition of all nodes in the compute partition (without going out and querying the compute parition):


    pingd -fast
    

To kill the PCT on node 20:


    pingd -kill -l 20
    

To reset the PCTs on nodes 0 through 100 which are running my jobs, either of these will work. (The -l option specifier may be omitted when specifying a node list.)


    pingd -reset -l 0..100 -mine
    pingd -reset 0..100 -mine
    

To display much status information about my jobs:


    pingd -m -v
    

To inform the bebopd that the PCTs on nodes 55, 56, 57 and 61 are dead:


    pingd -gone -l 55..57,61
    

Errors


Some compute nodes may be slow to respond to your request, and pingd does not wait for them. This is not an error. Run pingd again with the -fast option to get the updates which arrived at the bebopd (service node daemon) after your pingd display. (Running without -fast would cause the bebopd to go out and query all the compute nodes again.)

Files


/cplant/cplant-host
This file identifies the location of a bebopd daemon.
/var/log/cplant
This is the log file where Cplant daemons and utilities log status.

See Also


bebopd

Bugs


Let us know if you locate any (cplant-help@cs.sandia.gov).


Table of Contents