Network Queueing System (NQS) Usage Notes 2/1/95

Acoma now supports the use of NQS for submitting batch jobs. The Paragon will be dedicated to NQS jobs during off-prime hours: nights and weekends. This is the required method of running jobs overnight. The daytime usage model is unchanged. Dedicated time will still be scheduled for exceptional cases, but its need should be all but eliminated by the batch system.

Jobs are submitted to the NQS system using the qsub command. At present, you must be logged in to acoma when you execute the qsub command. You can check on jobs by using the qstat command. Manual pages for these commands are on the Paragon and cross-development platforms. More detailed NQS information is available in the Paragon Network Queueing System Manual, which is available on jemez, pajarito, and siesta in "/usr/iparagon/current/paragon/ps.docs/pnqsman.ps".

The system is configured with four queues to which a single user may submit jobs. These are "background", "background.day" and two additional queues. The assignment of these additional queues is determined when you are assigned a user-id on the Paragon. For Sandia users, these queues would be "sandia" and "sandia.day". For external users, these queues are "default" and "default.day". NCHPC users are assigned to "nchpc" and "nchpc.day". Intel is assigned the queues "intel" and "intel.day".

Each user is allowed a maximum of three jobs at a time in any single queue. Running jobs are not counted in determining the allowed number of jobs. The limit is enforced by a script which is usually run automatically once per day, Monday through Friday during normal Sandia work weeks. The script may be run more often if necessary in order to manage the queues.

An NQS job consists of a short script, which contains commands to launch your job. The script may also contain general OSF commands which run on the service partition. For example, a simple application is usually started on 16 nodes using this yod command:

yod -sz 16 hello.world

The NQS script ("doit") to launch the same application is as follows:

#! /bin/sh
date
cd sunmos-src/hello
yod hello.world

The NQS system starts all jobs in your home directory, so that you may need to issue a cd command from within your NQS script to change to the proper directory. This script may be submitted to NQS to run on 16 nodes using this command:

qsub -q sandia -lP 16 doit

Instead of including the specific directory from which to run your job in your script, you may use the environment variable QSUB_WORKDIR, which stores the directory from which the job was submitted to NQS. Using the following "doit" script,

#! /bin/sh
date
cd $QSUB_WORKDIR
yod hello.world

this job can be run with the commands

cd sunmos-src/hello
qsub -q sandia -lP 16 doit

NQS will start the job from sunmos-src/hello. Once the job has been submitted to NQS, you may change directories and continue with your work.

You should select the proper number of nodes for your job using the "-lP" command line option to qsub, otherwise the system will default to 1824 nodes. The yod command will determine the correct number of nodes from the NQS submission so that the "-sz" option is superfluous. The date command is an example of a general OSF command included in the batch script.

Some additional qsub examples are

qsub -q sandia -lP 1824 doit            run on all nodes
qsub -q sandia.day -lP 22 doit          run on 22 nodes 
qsub -q sandia -lP 1024 -lT 2:00:00     run on 1024 nodes for 2 hours max

An NQS job is not restarted if the system crashes or NQS is shut down while the job is executing. We are hoping to provide this capability in the future.

Jobs that are queued (but not yet executing) are preserved across system crashes and will run when the system returns to service.

Output from the script appears in two files: one for standard output, and one for standard error. For the example script above, the files are named "doit.oXX" and "doit.eXX". Where, XX is the job number returned by NQS. These files appear in the directory that the qsub command was executed.

CHANGES TO NQS QUEUES:

NQS is currently configured so that jobs that have requested more time than there is available before the nonprime-to-prime switchover (at 08:00 MST) will not start. This prevents large jobs from interfering with prime-time usage, and also eliminates the need for these jobs to be killed, requiring the user to resubmit them and consequently acquiring a lower priority in the queue.

The most visible change is the addition of various queues with access limited to specific groups of users. Every user has access to the background queues, "background" and "background.day". These low-priority queues were set up to use spare cycles on the machine, should they become available.

Here is a summary of the available queues:

Queue Name        Availability       Time Limit    Maximum No. 
                                     (hours)       of Nodes
--------------   -----------------   -----------   -----------
sandia.day        8:00 - 17:00 MST    1.25         1024
nchpc.day         8:00 - 17:00 MST    1            1024
default.day       8:00 - 17:00 MST    1            1024
intel.day         8:00 - 17:00 MST    1            1024
background.day    8:00 - 17:00 MST    1            1024

sandia           17:00 -  8:00 MST    7            1824
nchpc            17:00 -  8:00 MST    2            1824
default          17:00 -  8:00 MST    2            1824
intel            17:00 -  8:00 MST    2            1824
background       17:00 -  8:00 MST    2            1824

NQS USAGE TIPS:

Jobs with finite time limits are run before jobs with unlimited time limits. You will get better turnaround time on your jobs by providing accurate run time estimates using the -lT parameter to qsub.

You can use the -v option to qstat which will show the number of nodes associated with each request. This requires a wide screen (132 columns) to display cleanly.

A FINAL NOTE:

Thank you for your patience while we converge on a NQS solution that performs well for our workload. As usual, submit problems and suggestions to iparagon@cs.sandia.gov.