Next: yod Parallel Application Launcher
Up: Interfaces
Previous: SIGHUP
  Contents
  Index
SIGUSR1
If the PCT receives a SIGUSR1, it will log certain information
(to /var/log/cplant) that may be useful in debugging problems.
Here is an example:
PCT received SIGUSR1 (30)
was in listen_app_process (process start request)
then in wait_for_app (top)
now in bebopd_request_check ()
PCT status 3
loop counter 805892
attn 0 started 1 done 0
pid 549 ppid 114 child_rc 0 child_term_sig 0
start_time 991104989 end_time 0 bt_size 0
status 0x0000
The PCT lists the last 3 routines it was in before receiving
the signal. In this example the PCT was checking for application
process messages, then checking for application process
termination, then checking for bebopd messages.
It then lists a status code. The code values are:
- 1
- Free - the PCT is available to run a job.
- 2
- Pending - the PCT has been allocated to a job but hasn't yet
received the initial message from a yod process. If it doesn't hear
from a yod process soon, the PCT will transition back to Free.
- 3
- Busy - the PCT is hosting a member of a parallel application.
- 4
- Down - the PCT has encountered a serious problem and is about to exit.
- 6
-
Trouble - the PCT has encountered a problem that can be fixed
by system administrators. It will refuse to load an application
process when it is in this state. See /var/log/cplant for a
description of the problem. (See the description of the
PCT_HEALTH_CHECK variable to modify the things checked
by the PCT when determining whether or not there is trouble on the node.)
The loop counter is incremented each time the PCT traverses it's
main loop. If subsequent displays don't show the counter incrementing,
this means the PCT is stuck somewhere.
If the PCT is busy, the next lines display the state of the application
process.
- attn
- - "1" means something just happened with the application (like it
terminated, or it requested it's nid map) that the PCT has to
handle, but hasn't handled yet.
- started
- 1 if the app process has been forked, 0 if it hasn't
- done
- 1 if the app process has terminated, 0 if it hasn't
- pid
- system PID of the app process
- ppid
- portal PID of the app process
- child_rc
- if done, the exit code of the application process
- child_term_sig
- the terminating signal, if any, of the app process
- start time, end time
- start and end time (in seconds) of app process
- bt_size
- the size in bytes of the stack trace collected if the
app process terminated with a signal
- status
- a bit map indicating progress of the application process through
the load. The strings describing the bits are listed too. Once
the load has completed and the app is running, some of the
bits are erased because we don't care anymore how far they got
through the load process.
The application status bits of interest are:
- 0x0002
- application process is forked
- 0x0200
- the application process has been sent to user code
- 0x0400
- the application process has terminated
- 0x0800
- the application process encountered an error in startup
- 0x1000
- The PCT sent the application process a SIGTERM
- 0x2000
- The PCT sent the application process a SIGKILL
These following status bits are set during application startup and
cleared after the application procedes to user code. If they
are set, then the application may be stuck in it's
startup procedure:
- 0x0001
- application load has begun
- 0x0004
- application process has requested it's portal process ID map
- 0x0008
- application process has been sent it's portal process ID map
- 0x0010
- application process has requested it's node ID map
- 0x0020
- application process has been sent it's node ID map
- 0x0040
- application process has requested it's file IO proxy information
- 0x0080
- application process has been sent it's file IO proxy information
- 0x4000
- The application process in startup has sent the PCT it's portal process ID
- 0x8000
- The PCT has sent the application process a list of fyod file servers
- 0x0100
- The startup protocol is done and the application process is ready to procede
to user code
Another way to view these status bits is to run pingd -v. If
the PCT is hosting an application, the job status information will be listed.
Next: yod Parallel Application Launcher
Up: Interfaces
Previous: SIGHUP
  Contents
  Index
Lee Ann Fisk
2001-06-25