|
Batch Scheduling on Cplant/alaska with the Portable Batch System (PBS) |
|
The batch scheduling system used on Cplant/alaska is called PBS (Portable Batch System), an open source batch scheduling solution available from MRJ Technology Solutions.
The PBS code running on alaska has several modifications for our environment. For example, there was no resource built in to PBS to represent the compute nodes allocated to a job. (If you are familiar with PBS on clusters, you may think this should be the nodes resource. But the nodes resource represents the service nodes on which job scripts execute. Every job obtains one timeshared node from PBS.) Therefore we added a size resource to represent the number of compute nodes requested for the parallel applications. (It is analygous to the size argument of yod.) When submitting PBS jobs, you should specify the size requirement of the job.
In addition to the enhancements to PBS, the Cplant runtime codes have been augmented to support PBS. So, for example, you can use pingd to display or kill all the parallel applications started by a particular PBS job. And the bebopd node allocator updates the PBS server if the number of compute nodes in the machine changes.
The role of PBS is to handle requests for machine resources in a fair way, and monitor that jobs granted the use of resources for a fixed period of time do not exceed their allocated usage.
At the present time, two resources may be requested from PBS:
| walltime | The total time required by your PBS script
or interactive PBS session.
(Example: walltime=01:00:00 ) |
| size | The total number of compute nodes your script or
interactive session will require at any point in time.
(Example: size=64 ) |
PBS on Cplant does not currently support heterogeneous compute node types. A future enhancement will allow users to specify compute nodes with special properties, like large memory or a particular network interface or processor type.
A PBS job is a request for resources (machine time and compute nodes). You may submit a script which will access the resources at some later time, or you may request an interactive shell which will have access to the resources. PBS assigns your request a numeric job ID when it is submitted. Your job script or interactive shell, when scheduled to execute, may run many Cplant parallel applications. Each Cplant parallel application will have a numeric Cplant job ID.
When using the PBS utility qstat , your job will be identified by it's PBS job ID. PBS doesn't know about parallel applications started by yod during your job session. When running the Cplant utility pingd , your currently running parallel applications will be identified by both their Cplant job ID and PBS job ID. There may be several Cplant jobs for each PBS job. (You can specify either ID to pingd when displaying or resetting nodes.)
This section describes the current scheduling policy. We welcome suggestions for changes. Please send feedback or questions to alaska-help@sandia.gov , or direct discussion about suggested changes to alaska-users@sandia.gov .
There are two queues to which PBS jobs may be submitted:
| Queue Name | Queue Description |
|---|---|
| prime | The prime time queue. |
| nonprime | The non-prime time queue. |
Prime time is 8 AM to 6 PM (MST) on normal work days. Non-prime time is all other times.
During prime time, only jobs from the prime queue will be scheduled to run. These jobs may not request more than N/2 node hours, where N is the number of compute nodes in the machine. (A job's node hours request is the product of it's size and walltime requests.) Such a job can be submitted, but the PBS scheduler will reject it when it detects that the node hour request exceeds N/2.
In addition, the total of the node hours requests for all running jobs for a single user must not exceed N/2. The scheduler will hold on to jobs that would cause this limit to be exceeded and run them after the user's earlier jobs have completed.
PBS will not schedule a prime job if it's walltime request would extend past the end of prime time.
The maximum walltime request permitted is 10 hours.
The default resource request
for jobs in the prime queue is size=1,walltime=00:10:00.
During non-prime time, jobs from the nonprime queue have priority. If there are no runnable jobs in the nonprime queue, jobs in the prime queue will be considered for execution.
PBS will not schedule a job from the nonprime queue if it's walltime request would extend past the end of non-prime time, although it may schedule a prime job to begin during non-prime time, even though it may extend past non-prime time into prime time.
The default resource request
for jobs in this queue is size=1,walltime=00:10:00.
The PBS scheduler is a daemon process that evaluates all jobs in the queues and selects the next job to run. The scheduler running on alaska was built from the fairshare FIFO scheduler that comes packaged with PBS, enhanced with features required by Cplant.
All Cplant users are listed in a PBS configuration file where they are assigned an equal share of the machine. The PBS scheduler keeps track of the number of node hours each user has accumulated on the machine, and decays the node hour usage over time. During each scheduling cycle, PBS divides the user's share of the machine by the user's recent usage. It chooses the runnable job with the maximum value of this quotient. If several jobs have the same value, they are run in FIFO order.
If a job has not been scheduled for 24 hours it is declared to be starving and all other jobs are blocked until there are enough nodes for the starving job to run.
PBS monitors running jobs and can kill them if they exceed their allocated wallclock time. For Cplant, we have modified PBS so that it will kill a job only if there are other jobs waiting to run. PBS kills a job by sending a SIGKILL to all of it's parallel applications. If you are running interactively, your command shell will terminate within a few minutes of this. If you submitted a job script, it will be sent a SIGTERM, and later a SIGKILL if it hasn't yet terminated.
The Cplant node allocator refuses to allocate nodes to a yod request from a PBS job if the request would give the job more compute nodes than it was allocated by PBS. So while it is possible for a PBS job to exceed it's wallclock resource request, it is not possible for it to exceed its compute node (size) request.
There are two ways to obtain time on compute nodes. One way is to submit a job script to PBS, the other way is to request interactive time from PBS. Both types of requests are made with the PBS qsub command.
Your qsub request should specify the following options, which can be listed on the qsub command line or appear as PBS directives in the job script.
| Request | Description | Default |
|---|---|---|
| walltime request | The total time required by your PBS script or interactive PBS session. | 10 minutes |
| size request | The total number of compute nodes your script or interactive session will require at any point in time. | 1 node |
| queue name | The queue to which your job should be appended. | prime |
The following job script runs a single 64 node application. It first changes to the directory from which the job was submitted, as otherwise PBS will run the script in the user's home directory. (The PBS environment variables available to your PBS job are listed in the section titled PBS environment variables.)
#!/bin/bash # echo "******STARTING****************************" # cd $PBS_O_WORKDIR echo "my pbs job is $PBS_JOBID" # /cplant/bin/yod -sz 64 /home/pbsuser/tests/sometest # echo "******DONE********************************"
To submit the job script, which is named script1, to the prime queue, requesting 10 minutes and 64 compute nodes, enter the qsub command:
/bin/qsub -l walltime=0:10:0,size=64 -q prime script1 50.service-4.sandia.gov
Type in the qsub command and immediately PBS will display a PBS job ID and place your job in the queue. (The numeric job ID is followed by the name of the machine on which the PBS server is running.) You can now run the PBS qstat -s command to see the status of your job.
Instead of placing options on the qsub command line, you could place directives in the job script:
#!/bin/bash # #PBS -l walltime=0:10:0 #PBS -q prime #PBS -l size=64 # echo "******STARTING****************************" # cd $PBS_O_WORKDIR echo "my pbs job is $PBS_JOBID" # /cplant/bin/yod -sz 64 /home/pbsuser/tests/sometest # echo "******DONE********************************"
And submit the script with the command qsub script1.
At some point in time, PBS will forward your job script to a service node for execution. (Note - As of 1/11/00 PBS will run your job on the service node on which you ran qsub. This is due to the fact that users are writing data to the service node local disk. When Cplant has parallel IO, PBS will revert to load balancing PBS jobs across service nodes.) Upon job completion, output files containing your script's standard output and standard error streams will be written to the directory from which you submitted the job. (See the qsub man page for numerous options including an option that tells PBS to write the output files to an alternate location.)
Note that your PBS job is your job script, not your parallel application. Until your job script completes, the compute nodes you have requested are unavailable to other users, even though you may have no parallel applications running. If you have a lengthy amount of file copying or other cleanup work to do when your application completes, consider submitting two job scripts. The first job script runs your parallel application(s). The second script requests no compute nodes and does your post-processing. Inform PBS that the second script is dependent on the completion of the first script. (See Submitting jobs with dependencies, below.)
The other way to obtain machine time is to request interactive time
from PBS. This is done with the -I option of
qsub. The following command requests
10 minutes of interactive time during which up to 64 compute nodes
may be used. The request is submitted to the prime queue.
PBS displays a job ID, and eventually a ready message at
which time a shell prompt appears.
qsub -I -l walltime=0:10:0,size=64 -q prime qsub: waiting for job 51.service-4.sandia.gov to start qsub: job 51.service-4.sandia.gov ready $ yod -sz 64 mycode
It is possible to submit two job scripts to PBS, and to specify that the second script should run only after the first script has completed.
In the next example, the script named script2 will run only after script1 (which is PBS job number 317) has completed. (Note that script1 has to be "in the system" when the second qsub command is entered. If script1 runs and terminates before the second qsub command is entered, script2 will never execute.)
qsub script1 317.service-6.sandia.gov qsub script2 -W depend=afterany:317 318.service-6.sandia.gov
In this next example, script2 will be scheduled to run only if script1 terminates without an error code.
qsub script1 322.service-6.sandia.gov qsub script2 -W depend=afterok:322 323.service-6.sandia.gov
In the final example, script2 will be scheduled to run only if script1 terminates with an error code.
qsub script1 352.service-6.sandia.gov qsub script2 -W depend=afternotok:352 353.service-6.sandia.gov
See the qsub man page for ways to specify other dependency relationships among sets of jobs.
The two most commonly used commands will probably be qsub (to submit a batch job) and qstat (to view job status). The list below shows all PBS commands that may be helpful to users.
| Command | Description |
|---|---|
| qalter | alter a PBS batch job |
| qdel | delete a PBS batch job |
| qhold | place a hold on a PBS batch job |
| qmove | move a job to a different queue |
| qorder | exchange the FIFO ordering of two jobs in a queue |
| qrerun | terminate a job and return it to the queue |
| qrls | release a hold on a job |
| qselect | list all jobs meeting certain criteria |
| qsig | send a signal to a PBS job |
| qstat | list all PBS jobs in the system |
| qsub | submit a new job to the PBS |
This simple script changes to the directory from which the job was submitted and runs a 32 node application.
# cd $PBS_O_WORKDIR yod -sz 32 tests/atest echo all done #
Then to request 32 nodes for 15 minutes during prime time, assuming the
script name is myscript:
qsub -l size=32,walltime=0:15:0 -N myjobname -q prime myscript 3121.myri-15.SU-4.SM-0.alaska
The following job script contains PBS directives. They name the job (test64nodes), request 10 minutes of machine time and 64 compute nodes, specify that mail should be sent to pbsuser@sandia.gov, and that the job should be submitted to the queue named prime. (For all directives, see the qsub man page.) It changes to the directory from which the job was submitted, runs two 32 node jobs in the background, and waits for them to complete.
Note that if the script must wait for the yod jobs to complete, as PBS will kill all outstanding parallel applications for this PBS job upon termination of the script.
#!/bin/bash # #PBS -N test64nodes #PBS -l walltime=0:10:0 #PBS -q prime #PBS -M pbsuser@sandia.gov #PBS -l size=64 # # Change to the directory from which I submitted the # job, otherwise PBS runs the script in my home directory. # cd $PBS_O_WORKDIR echo "**********************************" date echo "my pbs job is $PBS_JOBID" /cplant/bin/yod -sz 32 /home/pbsuser/tests/sometest &> yod1.out & /cplant/bin/yod -sz 32 /home/pbsuser/tests/othertest &> yod2.out & # # Wait for all background jobs to complete. # wait date echo "**********************************"
Since all the options are specified in the script, it is submitted by simply entering
qsub myscript 3125.myri-15.SU-4.SM-0.alaska
PBS does not automatically pass all environment variables defined
when you run qsub to the shell in which your job eventually executes.
The table below lists the environment variables always set up by PBS
and available to your job. You can specify a list of other
variables to be passed to the job with the -v option to
qsub. You can request that all
environment variables defined in your environment be passed to
the job with the -V argument to qsub.
| Variable name | Description |
|---|---|
| PBS_O_HOME | the HOME environment variable of the submitter |
| PBS_O_SHELL | the SHELL environment variable of the submitter |
| PBS_NNODES | the submitter's "size" resource request |
| PBS_O_QUEUE | the name of the queue to which you submitted your request |
| PBS_O_HOST | the name of the host from which you submmitted your request |
| PBS_ENVIRONMENT | equal to PBS_INTERACTIVE or PBS_BATCH |
| PBS_O_LOGNAME | the LOGNAME environment variable of the submitter |
| PBS_O_PATH | the PATH environment variable of the submitter |
| PBS_O_WORKDIR | the path name of the directory from which you submitted your request |
| PBS_JOBNAME | the value of the -N argument to qsub |
| PBS_O_MAIL | the MAIL environment variable of the submitter |
| PBS_JOBID | the PBS job ID assigned to the job |
Here are the steps to obtain interactive time from PBS. Suppose you want 64 nodes for one hour during prime time.
-M option to qsub.qsub -I -l size=64,walltime=1:0:0 -N myjob -q prime 56.myri-15.SU-4.SM-0.alaska(PBS replies that it has accepted your job and assigned it job ID number 56.)
ready message and a prompt. Now you can
work interactively for one hour and use up to 64 nodes at a time. qsub: job 56.myri-15.SU-4.SM-0.alaska ready $ yod -sz 64 mycode
And here are the steps to submit a job script to PBS. The script requires
64 nodes for one hour. It is submitted to the prime queue.
For a non-prime time request, submit to the nonprime queue.
-M option to qsub.#PBS are
PBS directives. The available directives are described in the
qsub man page. Alternatively, these options
may be specified on the qsub command line.#!/bin/bash # #PBS -l size=64 #PBS -l walltime=1:0:0 #PBS -q prime #PBS -N myjobname # echo "******STARTING****************************" # # cd to the directory from which I submitted the # job. Otherwise it will execute in my home directory. # cd $PBS_O_WORKDIR echo "my pbs job is $PBS_JOBID" # /cplant/bin/yod -sz 64 /home/pbsuser/tests/sometest # echo "******DONE********************************"
testscript:qsub testscript 251.myri-15.SU-4.SM-0.alaska
Send comments about this document to lafisk@sandia.gov