Next:
Contents
 
Contents
 
Index
Cplant Parallel Application Support:
Job Launch, Monitoring and Termination
Batch scheduling and debugging
Administrator's Guide (1.0)
Lee Ann Fisk
May 11 2001
Contents
Introduction
Overview
Initializing and Managing a Virtual Machine
Configuration Files and Log files
About the location of configuration and log files
Site file
Cplant host file
Cplant map file
The virtual machine name file
The user log files
The cplant log file
The
bebopd
restart file
Starting the Daemons
Stopping and Restarting the Daemons
The PCT
The bebopd
Running a Parallel Application
bebopd Node Allocator
Command line arguments
Typical use
Interfaces
SIGUSR1
SIGUSR2
SIGHUP
User log file
Error and Warnings
PCT
Compute Node Daemon
Command line arguments
Typical use
Interfaces
Incoming messages
Status updates
SIGHUP
SIGUSR1
yod
Parallel Application Launcher
Command line arguments
General options
Debugging options
Testing options
Heterogeneous load file
Environment variables
YODRETRYCOUNT
PATH
CPLANT_STDERR and CPLANT_STDOUT
PBS variables
Typical use
Load errors
Common load errors detected by yod
Common load errors reported by the PCTs to yod
Failure of a compute node
Failure of application
Retries upon load failure
User log file
pingd
Status Utility
Command line arguments
Action options
Node specifier options
Display options
Testing options
Usage Examples
Debugging parallel applications
yod -bt
cgdb - a gdb front end
Running gdb on the compute node
Running the application under strace
PBS for Cplant
PBS components
Cplant runtime components
PBS runtime configuration files and logging
Log files useful for troubleshooting
Configuration files
Starting and stopping PBS
pbs_server
pbs_mom
pbs_sched
Common runtime problems and solutions
Clients can't reach server
Node counts out of balance
Count of active compute nodes
Count of assigned nodes
Can't delete a job that has completed
Nodes are free but scheduler won't run any jobs
User doesn't get their output files
User doesn't know why their job exited early
Building and installing PBS on Cplant
Maintaining the runtime directories
Setting up a virtual machine
The VM software tree
user-env
pbs-env
vm-config
and
types
files
site file
cplant-host file
vmname file
Start it up
Solutions to common runtime problems
moving the bebopd to a different node
executable check sum error
users notice there are free nodes (PBS jobs aren't using all their nodes)
stale PCTs
load failure scenarios
yod can't write userlog file because of old LOCK files
Site definition file format
Bibliography
Index
About this document ...
Lee Ann Fisk 2001-06-25