next up previous contents index
Next: Clients can't reach server Up: PBS for Cplant Previous: pbs_sched   Contents   Index

Common runtime problems and solutions

Because it is a network of communicating daemons with dependency on the Cplant runtime system, the PBS system can find itself out of balance. A node may die while the MOM is communicating with the server, or an administrator may inadvertantly change the resources_available.size attribute of the server while using qmgr. An NFS file system that the MOM needs to copy user files to may be unavailable, hanging the MOM daemon. Any number of mishaps can cause a problem.

When trouble shooting a problem, you should look in the log files of the server, MOMs and scheduler. Each writes one log file per day, and they are located in /tmp/pbs/working/server_logs, /tmp/pbs/working/mom_logs, and /tmp/pbs/working/sched_logs respectively.

If you are tracing a problem with a certain job, look first in the server's accounting file to determine which service node it ran on. The accounting files (one per day) may be found in /tmp/pbs/working/server_priv/accounting. Look for the exec_host field in the "S" or "E" record for the job. (These are records written by the server when the job is scheduled to run and when it exits.) Then go to that node and look for entries for the job in the MOMs log file.

This section describes some of the common problems and how to solve them.



Subsections
next up previous contents index
Next: Clients can't reach server Up: PBS for Cplant Previous: pbs_sched   Contents   Index
Lee Ann Fisk 2001-06-25