next up previous contents index
Next: User doesn't get their Up: Common runtime problems and Previous: Can't delete a job   Contents   Index

Nodes are free but scheduler won't run any jobs

Another mysterious problem that occurs at times is that PBS jobs waiting for nodes are not being scheduled to run, even though there are sufficient nodes to run the jobs.

First verify that the system is not being drained so that a starving job can run. qstat -s will display a message under a job if it is being held to free up nodes for a starving job. (A starving job is a job that has been waiting for more than 24 hours. PBS will refuse to run any jobs until it has sufficient nodes to run the starving job.)

Next you should force a scheduling cycle with a qmgr command:

command>> qmgr
Max open servers: 4
Qmgr: set server scheduling=true
Qmgr: quit

If qstat reveals that the jobs are still not being scheduled to run, then look at the scheduler's log file (in /tmp/pbs/working/sched_logs) for the scheduler's explanation of why it is not running the job.

It may also help to check the server's totals, since the scheduler will not try to run jobs unless the resources_available.size less the resources_assigned.size is large enough to run at least one of the jobs.

command>> qmgr -c "list server resources_available.size"

Server myri-0.n-4.r-3
        resources_available.size = 918

command>> qmgr -c "list server resources_assigned.size"

Server myri-0.n-4.r-3
        resources_assigned.size = 603

If it's resources_available.size is too low, force an update with pingd -PBSupdate on. If it's resources_assigned.size is too high, restart the server so it will read it's database and recalculate this value.

If the scheduler is in fact scheduling jobs to run, look in the server's log file. The server may indicate that the MOMs are refusing to run the jobs. In this case, try restarting the MOMs, and then restarting the server. If it is only one MOM that is refusing to run jobs, then take that MOM offline with a qmgr command.

command>> qmgr
Max open servers: 4
Qmgr: set node myri-0.n-2.r-3 state=offline
Qmgr: quit

command>> pbsnodes -a

myri-0.n-0.r-3
     state = free
     np = 1
     ntype = time-shared

myri-0.n-2.r-3
     state = offline
     np = 1
     ntype = time-shared


next up previous contents index
Next: User doesn't get their Up: Common runtime problems and Previous: Can't delete a job   Contents   Index
Lee Ann Fisk 2001-06-25