Occasionally a job is listed as running by qstat but it has actually completed, and attempts to delete the job from the PBS system with the command qdel fail. This can happen when the MOM catches the termination of the job, but the network fails when it sends the obituary to the server. When you run qdel to delete it, the PBS server sends a command to the MOM to delete the job, but the MOM refuses because it believes it has already done that. So the PBS server refuses to delete the job from it's records.
This situation can also arise when the MOM on the node that had been hosting the job is no longer running, or when the MOM's node is unreachable. PBS can't delete the job because it can't reach the MOM.
To fix this situation remove all files and directories related to the job that were maintained by the PBS server and the MOM daemon. It is polite to send the stdout and stderr streams to the user if you find them. Then restart the PBS server.
Suppose for example that PBS job number 64 has completed, but qstat shows it is still running. Then follow these steps to remove the job:
On the MOM's node (the service node on which the job ran):
command>> cd /tmp/pbs/working/undelivered OR cd /tmp/pbs/working/spool command>> ls 275.myri-.ER 323.myri-.OU 503.myri-.OU 64.myri-15.ER 275.myri-.OU 547.myri-.ER 1180.myri-1.ER 64.myri-15.OU 323.myri-.ER 547.myri-.OU 1180.myri-1.OU command>> mail JoeUser -s "your pbs output" < 64.myri-15.ER command>> mail JoeUser -s "your pbs output" < 64.myri-15.OU command>> rm 64.* command>> cd /tmp/pbs/working/mom_priv/jobs command>> ls 820.myri-.JB 881.myri-.SC* 64.myri-.TK/ 820.myri-.SC* 881.myri-.TK/ 820.myri-.TK/ 881.myri-.JB 64.myri-.SC* 64.myri-.JB command>> rm -rf 64.*
On the server's node:
command>> cd /tmp/pbs/working/server_priv/jobs command>> ls -l total 92 -rw------- 1 root root 4577 Jun 1 09:18 820.myri-0..JB -rw------- 1 root root 284 Jun 1 09:18 820.myri-0..SC -rw------- 1 root root 4598 Jun 1 09:22 881.myri-0..JB -rw------- 1 root root 182 Jun 1 09:22 881.myri-0..SC -rw------- 1 root root 3911 May 30 12:21 64.myri-0.n.JB -rw------- 1 root root 462 May 30 12:21 64.myri-0.n.SC command>> rm 64.* command> /cplant/etc/pbs-env restart server -->> Executing /cplant/etc/pbs-env restart server restarting pbs_server, please wait -->> Done with /cplant/etc/pbs-env restart
Note that the fact that a PBS job is not using any compute nodes does not mean it has completed. The PBS job is a job script, and the script may be doing other work at the moment like copying files. You can tell a job has not completed if there is a process running on the service node it ran on that has the PBS job ID in it's name. The job has completed if you find entries in the MOM's log file that indicate that the job terminated.
This is what job termination looks like in the MOM's log file:
Job;177.myri-0.n-4.r-3;task 1 terminated Job;177.myri-0.n-4.r-3;Terminated Job;177.myri-0.n-4.r-3;/cplant/sbin/pingd -reset -pbs 177 Job;177.myri-0.n-4.r-3;kill_job Job;177.myri-0.n-4.r-3;Obit sent
The log entries show that the MOM caught the termination of PBS job number 177, ran pingd -reset to ensure that no parallel applications were left running by the job script, and sent an obituary to the PBS server.