Next: Building and installing PBS
Up: Common runtime problems and
Previous: User doesn't get their
  Contents
  Index
Occasionally a user will report that their job exited before completing
and they don't know why. There are several possible reasons for this:
- Their job ran out of time. Look in the server's accounting file to see
which service node their job ran on. Then look in the MOM's log file on
that node. If the MOM killed the job because it ran over it's walltime
allocation, you will see a message like
walltime used (24:30:52) exceeds limit (24:00:00).
There is also a message in their .e file telling them that PBS killed
their job.
- One of their processes exited with an error or a signal. They should be
able to determine this by looking at their .o and .e files, since
yod will report such an error. Alternatively, look in the server's
accounting file to determine what node their job ran on, then look in
the userlog file written by yod for an entry about their job. The
entry will indicate if one of their processes terminated with an
error code or a signal.
- The MOM refused to run the job that was handed to it by the server. Again,
look in the server's accounting file to determine which node that job
ran on. Then look in the MOM's log file on that node. If the MOM refused
to run the job, it will log that fact to the file, along with a reason.
This happens rarely, and if it does the best course of action is to restart
the server and all the MOMs.
Next: Building and installing PBS
Up: Common runtime problems and
Previous: User doesn't get their
  Contents
  Index
Lee Ann Fisk
2001-06-25