next up previous contents index
Next: Nodes are free but Up: Common runtime problems and Previous: Count of assigned nodes   Contents   Index

Can't delete a job that has completed

Occasionally a job is listed as running by qstat but it has actually completed, and attempts to delete the job from the PBS system with the command qdel fail. This can happen when the MOM catches the termination of the job, but the network fails when it sends the obituary to the server. When you run qdel to delete it, the PBS server sends a command to the MOM to delete the job, but the MOM refuses because it believes it has already done that. So the PBS server refuses to delete the job from it's records.

This situation can also arise when the MOM on the node that had been hosting the job is no longer running, or when the MOM's node is unreachable. PBS can't delete the job because it can't reach the MOM.

To fix this situation remove all files and directories related to the job that were maintained by the PBS server and the MOM daemon. It is polite to send the stdout and stderr streams to the user if you find them. Then restart the PBS server.

Suppose for example that PBS job number 64 has completed, but qstat shows it is still running. Then follow these steps to remove the job:

On the MOM's node (the service node on which the job ran):

command>> cd /tmp/pbs/working/undelivered   
          cd /tmp/pbs/working/spool

command>> ls
275.myri-.ER  323.myri-.OU  503.myri-.OU    64.myri-15.ER
275.myri-.OU  547.myri-.ER  1180.myri-1.ER  64.myri-15.OU
323.myri-.ER  547.myri-.OU  1180.myri-1.OU

command>> mail JoeUser -s "your pbs output" < 64.myri-15.ER
command>> mail JoeUser -s "your pbs output" < 64.myri-15.OU
command>> rm 64.*

command>> cd /tmp/pbs/working/mom_priv/jobs
command>> ls
820.myri-.JB   881.myri-.SC*  64.myri-.TK/
820.myri-.SC*  881.myri-.TK/  820.myri-.TK/
881.myri-.JB   64.myri-.SC*   64.myri-.JB

command>> rm -rf 64.*

On the server's node:

command>> cd /tmp/pbs/working/server_priv/jobs

command>> ls -l
total 92
-rw-------    1 root     root         4577 Jun  1 09:18 820.myri-0..JB
-rw-------    1 root     root          284 Jun  1 09:18 820.myri-0..SC
-rw-------    1 root     root         4598 Jun  1 09:22 881.myri-0..JB
-rw-------    1 root     root          182 Jun  1 09:22 881.myri-0..SC
-rw-------    1 root     root         3911 May 30 12:21 64.myri-0.n.JB
-rw-------    1 root     root          462 May 30 12:21 64.myri-0.n.SC

command>> rm 64.*

command> /cplant/etc/pbs-env restart server

-->> Executing /cplant/etc/pbs-env  restart server
restarting pbs_server, please wait
-->> Done with /cplant/etc/pbs-env restart

Note that the fact that a PBS job is not using any compute nodes does not mean it has completed. The PBS job is a job script, and the script may be doing other work at the moment like copying files. You can tell a job has not completed if there is a process running on the service node it ran on that has the PBS job ID in it's name. The job has completed if you find entries in the MOM's log file that indicate that the job terminated.

This is what job termination looks like in the MOM's log file:

Job;177.myri-0.n-4.r-3;task 1 terminated
Job;177.myri-0.n-4.r-3;/cplant/sbin/pingd -reset -pbs 177
Job;177.myri-0.n-4.r-3;Obit sent

The log entries show that the MOM caught the termination of PBS job number 177, ran pingd -reset to ensure that no parallel applications were left running by the job script, and sent an obituary to the PBS server.

next up previous contents index
Next: Nodes are free but Up: Common runtime problems and Previous: Count of assigned nodes   Contents   Index
Lee Ann Fisk 2001-06-25