Using PBS on Cplant
[cplant logo]

Batch Scheduling on Cplant with the Portable Batch System
User Guide
(September 2000)

[pbs logo]

PBS administrators: Download the Cplant Fault Recovery Patch for PBS here. (It works at non-Cplant sites as well.) Or see the README file describing the patches.

Cplant/PBS administrators: PBS for Cplant Administrator's chapter , (Postscript version of entire admin document), PBS source code , PBS for Cplant build package , (including entire Cplant patch and build/install/configure instructions)


  1. PBS on Cplant
  2. PBS/Cplant scheduling concepts
    1. Resources
    2. The PBS job ID and the Cplant job ID
  3. Scheduling policy
    1. The queues
    2. Prioritizing of jobs in queues
    3. The fairshare algorithm
    4. Deadbeat jobs
  4. Submitting jobs to PBS
    1. PBS job script
    2. PBS interactive shell
  5. Submitting jobs with dependencies to PBS
  6. Useful PBS commands
  7. Sample job scripts
  8. PBS environment variables
  9. Quick Start

Previous version of batch scheduling document (1/00)
  1. PBS on Cplant

    The batch scheduling system used on Cplant is called PBS (Portable Batch System), an open source batch scheduling solution available from Veridian Systems.

    The PBS code running on Cplant has several modifications for our environment. For example, there was no resource built in to PBS to represent the compute nodes allocated to a job. (If you are familiar with PBS on clusters, you may think this should be the nodes resource. But the nodes resource represents the service nodes on which job scripts execute. Every job obtains one timehared node from PBS.) Therefore we added a size resource to represent the number of compute nodes requested for the parallel applications. (It is analygous to the size argument of yod.) When submitting PBS jobs, you should specify the size requirement of the job.

    In addition to the enhancements to PBS, the Cplant runtime codes have been augmented to support PBS. So, for example, you can use pingd to display or kill all the parallel applications started by a particular PBS job. And the bebopd node allocator updates the PBS server if the number of compute nodes in the machine changes.

  2. PBS/Cplant scheduling concepts
    1. Resources

      The role of PBS is to handle requests for machine resources in a fair way, and monitor that jobs granted the use of resources for a fixed period of time do not exceed their allocated usage.

      At the present time, two resources may be requested from PBS:

      walltime The total time required by your PBS script or interactive PBS session. (Example: walltime=01:00:00 )
      size The total number of compute nodes your script or interactive session will require at any point in time. (Example: size=64 )

      PBS on Cplant does not currently support heterogeneous compute node types. A future enhancement will allow users to specify compute nodes with special properties, like large memory or a particular network interface or processor type.

    2. The PBS job ID and the Cplant job ID

      A PBS job is a request for resources (machine time and compute nodes). You may submit a script which will access the resources at some later time, or you may request an interactive shell which will have access to the resources. PBS assigns your request a numeric job ID when it is submitted. Your job script or interactive shell, when scheduled to execute, may run many Cplant parallel applications. Each Cplant parallel application will have a numeric Cplant job ID.

      When using the PBS utility qstat , your job will be identified by it's PBS job ID. PBS doesn't know about parallel applications started by yod during your job session. When running the Cplant utility pingd , your currently running parallel applications will be identified by both their Cplant job ID and PBS job ID. There may be several Cplant jobs for each PBS job. (You can specify either ID to pingd when displaying or resetting nodes.)

  3. Alaska scheduling policy

    This section describes the current scheduling policy. We welcome suggestions for changes. Please send feedback or questions to alaska-help@sandia.gov , or direct discussion about suggested changes to alaska-users@sandia.gov .

    This policy went into effect on March 13, 2000. For information about the policy that preceded it, check here .

    1. The queues

      There are two queues to which PBS jobs may be submitted:

      Queue Name Queue Description
      default Regular jobs.
      dedicated Jobs requiring a large percentage of the machine.

      Jobs may be submitted to the default queue if their size request does not exceed some number set by the system adminstrator. Most likely this number will range between one half and three quarters of all available compute nodes. To determine the setting, on an Cplant service node type in this command:

      qmgr -c "list queue default resources_max.size"
      

      Jobs that require more compute nodes should be submitted to the dedicated queue. Dedicated time will be scheduled on Monday afternoon following system time if there are jobs in the dedicated queue. Dedicated time will end by Tuesday morning, when jobs from the default queue will be considered for execution again. Dedicated time jobs that are still running on Tuesday morning will be permitted to run to completion. The dedicated queue will have an upper limit on walltime requests, and a lower limit on size requests. To check these limits, type these commands on an Cplant service node:

      qmgr -c "list queue dedicated resources_max.walltime"
      qmgr -c "list queue dedicated resources_min.size"
      

      or to list all attributes of the queue, type this command:

      qmgr -c "list queue dedicated"
      

      To summarize, large jobs should be submitted to the dedicated queue. Jobs from this queue will be scheduled for execution on Monday evenings. No jobs from the default queue will be considered for execution during this time, even if there are free nodes. The rest of the week, only jobs from the default queue will be scheduled for execution.

    2. Prioritizing of jobs in queues

      The PBS scheduler will prioritize jobs differently during prime and non-prime time. Prime time is 8 AM to 6 PM (MST) on normal work days. Non-prime time is all other times. During both periods, jobs will be sorted first based on the amount of recent machine usage obtained by the owner of the job. (See the description of the fairshare algorithm.) During prime time, jobs will be sorted first by increasing usage, then by order of increasing size request, then by order of increasing walltime request. During non-prime time, jobs will be sorted first by increasing usage, then by order of decreasing size request, then by order of decreasing walltime request. In effect, the scheduler will put smaller jobs into execution first during the day, and larger jobs into execution first during nights and weekends.

    3. The fairshare algorithm

      The PBS scheduler is a daemon process that evaluates all jobs in the queues and selects the next job to run. The scheduler running on Cplant was built from the fairshare FIFO scheduler that comes packaged with PBS, enhanced with features required by Cplant.

      All Cplant users are listed in a PBS configuration file where they are assigned an equal share of the machine. The PBS scheduler keeps track of the number of node hours each user has accumulated on the machine, and decays the node hour usage over time. During each scheduling cycle, PBS divides the user's share of the machine by the user's recent usage. After sorting the jobs in the queue by size request and within size by walltime request, the scheduler chooses the first runnable job in the list who's owner sports the maximum value of this quotient.

      If a job has not been scheduled for 24 hours it is declared to be starving and all other jobs are blocked until there are enough nodes for the most starving job to run. To determine if a queued job is being blocked so a starving job may run, enter the qstat -s command on a service node. A notation under the listed job will indicate why the job has not yet been scheduled to run.

    4. Deadbeat jobs

      PBS monitors running jobs and can kill them if they exceed their allocated wallclock time. For Cplant, we have modified PBS so that it will kill a job only if there are other jobs waiting to run. PBS kills a job first by sending a SIGTERM to all parallel applications. (Actually, PBS invokes pingd -interrupt on all of your PBS job's Cplant applications.) If your PBS job script doesn't terminate within a few minutes, PBS will kill your script with a SIGKILL. If you are running interactively, your command shell will terminate.

      The Cplant node allocator refuses to allocate nodes to a yod request from a PBS job if the request would give the job more compute nodes than it was allocated by PBS. So while it is possible for a PBS job to exceed it's wallclock resource request, it is not possible for it to exceed its compute node (size) request.

  4. Submitting jobs to PBS

    There are two ways to obtain time on compute nodes. One way is to submit a job script to PBS, the other way is to request interactive time from PBS. Both types of requests are made with the PBS qsub command.

    Your qsub request should specify the following options, which can be listed on the qsub command line or appear as PBS directives in the job script.

    Request Description Default
    walltime request The total time required by your PBS script or interactive PBS session. 10 minutes
    size request The total number of compute nodes your script or interactive session will require at any point in time. 1 node
    queue name The queue to which your job should be appended. default

    1. PBS job script

      The following job script runs a single 64 node application. It first changes to the directory from which the job was submitted, as otherwise PBS will run the script in the user's home directory. (The PBS environment variables available to your PBS job are listed in the section titled PBS environment variables.)

      #!/bin/bash 
      #
      echo "******STARTING****************************"
      #
      cd $PBS_O_WORKDIR
      echo "my pbs job is $PBS_JOBID"
      #
      /cplant/bin/yod -sz 64 /home/pbsuser/tests/sometest
      #
      echo "******DONE********************************"
      

      To submit the job script, which is named script1, to the default queue, requesting 10 minutes and 64 compute nodes, enter the qsub command:

      /bin/qsub -l walltime=0:10:0,size=64 -q default script1 
      50.service-4.sandia.gov
      
      

      Type in the qsub command and immediately PBS will display a PBS job ID and place your job in the queue. (The numeric job ID is followed by the name of the machine on which the PBS server is running.) You can now run the PBS qstat command to see the status of your job. (Try the -a and -s options to qstat for more detail.)

      Instead of placing options on the qsub command line, you could place directives in the job script:

      #!/bin/bash 
      #
      #PBS -l walltime=0:10:0
      #PBS -q default
      #PBS -l size=64
      #
      echo "******STARTING****************************"
      #
      cd $PBS_O_WORKDIR
      echo "my pbs job is $PBS_JOBID"
      #
      /cplant/bin/yod -sz 64 /home/pbsuser/tests/sometest
      #
      echo "******DONE********************************"
      

      And submit the script with the command qsub script1.

      At some point in time, PBS will forward your job script to a service node for execution. Upon job completion, output files containing your script's standard output and standard error streams will be written to the directory from which you submitted the job. (See the qsub man page for numerous options including an option that tells PBS to write the output files to an alternate location.)

      Note that your PBS job is your job script, not your parallel application. Until your job script completes, the compute nodes you have requested are unavailable to other users, even though you may have no parallel applications running. If you have a lengthy amount of file copying or other cleanup work to do when your application completes, consider submitting two job scripts. The first job script runs your parallel application(s). The second script requests no compute nodes and does your post-processing. Inform PBS that the second script is dependent on the completion of the first script. (See Submitting jobs with dependencies, below.)

    2. PBS interactive shell

      The other way to obtain machine time is to request interactive time from PBS. This is done with the -I option of qsub. The following command requests 10 minutes of interactive time during which up to 64 compute nodes may be used. The request is submitted to the default queue. PBS displays a job ID, and eventually a ready message at which time a shell prompt appears.

      qsub -I -l walltime=0:10:0,size=64 -q default
      qsub: waiting for job 51.service-4.sandia.gov to start
      qsub: job 51.service-4.sandia.gov ready
      
      $ yod -sz 64 mycode
      
  5. Submitting jobs with dependencies to PBS

    It is possible to submit two job scripts to PBS, and to specify that the second script should run only after the first script has completed.

    In the next example, the script named script2 will run only after script1 (which is PBS job number 317) has completed. (Note that script1 has to be "in the system" when the second qsub command is entered. If script1 runs and terminates before the second qsub command is entered, script2 will never execute.)

    qsub script1
    317.service-6.sandia.gov
    qsub script2 -W depend=afterany:317
    318.service-6.sandia.gov
    

    In this next example, script2 will be scheduled to run only if script1 terminates without an error code.

    qsub script1
    322.service-6.sandia.gov
    qsub script2 -W depend=afterok:322
    323.service-6.sandia.gov
    

    In the final example, script2 will be scheduled to run only if script1 terminates with an error code.

    qsub script1
    352.service-6.sandia.gov
    qsub script2 -W depend=afternotok:352
    353.service-6.sandia.gov
    

    See the qsub man page for ways to specify other dependency relationships among sets of jobs.

  6. Useful PBS commands

    The two most commonly used commands will probably be qsub (to submit a batch job) and qstat (to view job status). The list below shows all PBS commands that may be helpful to users.

    Command Description
    qalter alter a PBS batch job
    qdel delete a PBS batch job
    qhold place a hold on a PBS batch job
    qmove move a job to a different queue
    qorder exchange the FIFO ordering of two jobs in a queue
    qrerun terminate a job and return it to the queue
    qrls release a hold on a job
    qselect list all jobs meeting certain criteria
    qsig send a signal to a PBS job
    qstat list all PBS jobs in the system
    qsub submit a new job to the PBS

    Note that the job of the qsig command is to send a signal to your job script, not to the parallel applications started by your job script. However we've modified the PBS MOM to forward SIGTERM and SIGKILL signals to your parallel applications. All other signals will be sent to your job script. If you want to send a SIGUSR1 or SIGUSR2 to your application, you need to locate the yod process that started it and send the signal to yod. yod forwards these two signals to the parallel application.


  7. Sample job scripts

    This simple script changes to the directory from which the job was submitted and runs a 32 node application.

    #
    cd $PBS_O_WORKDIR
    yod -sz 32 tests/atest
    echo all done
    #
    

    Then to request 32 nodes for 15 minutes during default time, assuming the script name is myscript:

    qsub -l size=32,walltime=0:15:0 -N myjobname -q default myscript
    3121.myri-15.SU-4.SM-0.alaska
    

    The following job script contains PBS directives. They name the job (test64nodes), request 10 minutes of machine time and 64 compute nodes, specify that mail should be sent to pbsuser@sandia.gov, and that the job should be submitted to the queue named default. (For all directives, see the qsub man page.) It changes to the directory from which the job was submitted, runs two 32 node jobs in the background, and waits for them to complete.

    Note that if the script must wait for the yod jobs to complete, as PBS will kill all outstanding parallel applications for this PBS job upon termination of the script.

    #!/bin/bash
    #
    #PBS -N test64nodes
    #PBS -l walltime=0:10:0
    #PBS -q default 
    #PBS -M pbsuser@sandia.gov
    #PBS -l size=64
    #
     
    # Change to the directory from which I submitted the
    # job, otherwise PBS runs the script in my home directory.
    #
    cd $PBS_O_WORKDIR
    
    echo "**********************************"
    date
    echo "my pbs job is $PBS_JOBID"
      
    /cplant/bin/yod -sz 32 /home/pbsuser/tests/sometest &> yod1.out &
    /cplant/bin/yod -sz 32 /home/pbsuser/tests/othertest &> yod2.out &
    
    #
    # Wait for all background jobs to complete.
    #
    wait
       
    date
    echo "**********************************"
    

    Since all the options are specified in the script, it is submitted by simply entering

    qsub myscript
    3125.myri-15.SU-4.SM-0.alaska
    
  8. PBS environment variables

    PBS does not automatically pass all environment variables defined when you run qsub to the shell in which your job eventually executes. The table below lists the environment variables always set up by PBS and available to your job. You can specify a list of other variables to be passed to the job with the -v option to qsub. You can request that all environment variables defined in your environment be passed to the job with the -V argument to qsub.

    Variable name Description
    PBS_O_HOME the HOME environment variable of the submitter
    PBS_O_SHELL the SHELL environment variable of the submitter
    PBS_NNODES the submitter's "size" resource request
    PBS_O_QUEUE the name of the queue to which you submitted your request
    PBS_O_HOST the name of the host from which you submmitted your request
    PBS_ENVIRONMENT equal to PBS_INTERACTIVE or PBS_BATCH
    PBS_O_LOGNAME the LOGNAME environment variable of the submitter
    PBS_O_PATH the PATH environment variable of the submitter
    PBS_O_WORKDIR the path name of the directory from which you submitted your request
    PBS_JOBNAME the value of the -N argument to qsub
    PBS_O_MAIL the MAIL environment variable of the submitter
    PBS_JOBID the PBS job ID assigned to the job

  9. Quick start

    Here are the steps to obtain interactive time from PBS. Suppose you want 64 nodes for one hour during default time.

    And here are the steps to submit a job script to PBS. The script requires 64 nodes for one hour. It is submitted to the default queue. For a dedicated time request, submit to the dedicated queue.

Back to top.

Send comments about this document to lafisk@sandia.gov