next up previous contents index
Next: Can't delete a job Up: Node counts out of Previous: Count of active compute   Contents   Index

Count of assigned nodes

The PBS server also maintains a count of the number of nodes it has assigned to jobs. This is the value of the server's resources_assigned.size attribute. There are two ways this can get out of balance.

First, if you run the PBS command qstat -a it lists a total of assigned nodes at the end of the display. This should match the server's resources_assigned.size value.

command>> qstat -a

myri-0.n-4.r-3: 
                                           Time In Req'd  Req'd   Elap
Job ID Username Queue    Jobname    SessID  Queue  Nodes  Time  S Time
------ -------- -------- ---------- ------ ------- ------ ----- - -----
64     jrobbin  dedicate fine_rb3      --   117:32    700 100:0 Q   -- 
137    gsgrest  default  rrr10         641  072:35      8 96:00 R 72:35
139    gsgrest  default  runbead2      736  072:31    150 100:0 R 72:31
140    gsgrest  default  runpe-400     681  072:29     64 100:0 R 72:29
176    mechand  default  C6.33sh1     1985  064:17    128 40:00 R 64:17
200    swsides  default  ea          30304  007:23     16 40:00 R 07:13

Total compute nodes allocated: 366


command>>  qmgr -c "l s resources_assigned.size"
Server myri-0.n-4.r-3
        resources_assigned.size = 366

Here, the total number of node assigned to running job's is equal to the PBS server's resources_assigned.size.

If they are not equal, you should restart the PBS server. This usually fixes the imbalance.

Secondly, the PBS server's count of assigned nodes should never be less than the number of nodes actually in use by PBS jobs. The PBS jobs may be using fewer nodes than they have been assigned, but they should never be using more. You can determine how many nodes are actually in use with pingd.

command>> pingd -s
Awaiting status from bebopd...
Awaiting pct list from bebopd

Total: 1018
Total busy: 411
Total free: 605
Total not responding to ping (try again): 2

Compute nodes are being scheduled by PBS, but 100
nodes are currently reserved for non-PBS interactive use.

Nodes currently hosting PBS jobs:         366
Nodes currently hosting interactive jobs: 45

Free nodes remaining for interactive jobs: 55

pingd shows the 366 nodes are in use by PBS jobs. This is correct. The Nodes currently hosting PBS jobs must by no greater than the PBS server's resources_assigned.size.

If there are more nodes in actual use than the PBS server has assigned, then there is a PBS job running that the PBS server believes has exited. (This may happen when the network fails when the MOM is trying to kill the job. It notifies the PBS server that the job has been killed when in fact it has not.) To determine which job or jobs this is, run pingd and match the PBS job IDs displayed there with the running jobs listed in qstat. You should find a job or jobs listed by pingd that is not in qstat. Remove the job with pingd.

There is a useful script called check_for_invalid_pbs_jobs in the Cplant source tree under support/cplant/sbin.


next up previous contents index
Next: Can't delete a job Up: Node counts out of Previous: Count of active compute   Contents   Index
Lee Ann Fisk 2001-06-25