next up previous contents index
Next: yod Parallel Application Launcher Up: Interfaces Previous: SIGHUP   Contents   Index


SIGUSR1

If the PCT receives a SIGUSR1, it will log certain information (to /var/log/cplant) that may be useful in debugging problems. Here is an example:

PCT received SIGUSR1 (30)   
was in listen_app_process (process start request)   
then in wait_for_app (top)   
now in bebopd_request_check ()   
PCT status 3   
loop counter 805892   
attn 0 started 1 done 0    
pid 549 ppid 114 child_rc 0 child_term_sig 0   
start_time 991104989  end_time 0  bt_size 0   
status 0x0000

The PCT lists the last 3 routines it was in before receiving the signal. In this example the PCT was checking for application process messages, then checking for application process termination, then checking for bebopd messages.

It then lists a status code. The code values are:

1
Free - the PCT is available to run a job.

2
Pending - the PCT has been allocated to a job but hasn't yet received the initial message from a yod process. If it doesn't hear from a yod process soon, the PCT will transition back to Free.

3
Busy - the PCT is hosting a member of a parallel application.

4
Down - the PCT has encountered a serious problem and is about to exit.

6
Trouble - the PCT has encountered a problem that can be fixed by system administrators. It will refuse to load an application process when it is in this state. See /var/log/cplant for a description of the problem. (See the description of the PCT_HEALTH_CHECK variable to modify the things checked by the PCT when determining whether or not there is trouble on the node.)

The loop counter is incremented each time the PCT traverses it's main loop. If subsequent displays don't show the counter incrementing, this means the PCT is stuck somewhere.

If the PCT is busy, the next lines display the state of the application process.

attn
- "1" means something just happened with the application (like it terminated, or it requested it's nid map) that the PCT has to handle, but hasn't handled yet.

started
1 if the app process has been forked, 0 if it hasn't
done
1 if the app process has terminated, 0 if it hasn't
pid
system PID of the app process
ppid
portal PID of the app process
child_rc
if done, the exit code of the application process
child_term_sig
the terminating signal, if any, of the app process
start time, end time
start and end time (in seconds) of app process
bt_size
the size in bytes of the stack trace collected if the app process terminated with a signal
status
a bit map indicating progress of the application process through the load. The strings describing the bits are listed too. Once the load has completed and the app is running, some of the bits are erased because we don't care anymore how far they got through the load process.

The application status bits of interest are:

0x0002
application process is forked
0x0200
the application process has been sent to user code
0x0400
the application process has terminated
0x0800
the application process encountered an error in startup
0x1000
The PCT sent the application process a SIGTERM
0x2000
The PCT sent the application process a SIGKILL

These following status bits are set during application startup and cleared after the application procedes to user code. If they are set, then the application may be stuck in it's startup procedure:

0x0001
application load has begun
0x0004
application process has requested it's portal process ID map
0x0008
application process has been sent it's portal process ID map
0x0010
application process has requested it's node ID map
0x0020
application process has been sent it's node ID map
0x0040
application process has requested it's file IO proxy information
0x0080
application process has been sent it's file IO proxy information
0x4000
The application process in startup has sent the PCT it's portal process ID
0x8000
The PCT has sent the application process a list of fyod file servers
0x0100
The startup protocol is done and the application process is ready to procede to user code

Another way to view these status bits is to run pingd -v. If the PCT is hosting an application, the job status information will be listed.


next up previous contents index
Next: yod Parallel Application Launcher Up: Interfaces Previous: SIGHUP   Contents   Index
Lee Ann Fisk 2001-06-25