next up previous contents index
Next: Maintaining the runtime directories Up: PBS for Cplant Previous: User doesn't know why   Contents   Index

Building and installing PBS on Cplant

The README file in the source code repository [3] explains how to build and install PBS on a new machine. It is copied here for convenience.

Introduction:
-------------
This directory contains

  o cplantPatch        - a patch file of Cplant changes for PBS
  o obj - a directory containing two scripts to configure a directory
	  for building and installing PBS
  o configs.onequeue - a directory containing runtime configuration files for
	  PBS and a file of commands for setting the initial parameters of the
	  running PBS server.

The patches were created for version 2.2, patch level 8 of PBS.  The PBS
source is available for free at www.openpbs.com.

there is a good chance they will work on a later version of PBS.

Building PBS for Cplant:
------------------------
1. Obtain the Open PBS source from www.openpbs.com.  Tar up that source and
   the directory containing this README file and copy it to the 
   file system where you are going to build and install PBS.  (We'll refer 
   to the directory into which you install these files as PBSTOP.) 
   The binaries will be executed from the Cplant service nodes, so
   the file system should be one that is mounted on the service nodes
   or one that gets distributed to the sss0s.  We install our PBS
   executables in /cplant/nfs-root-alpha/sbin and /cplant/nfs-root-alpha/bin
   on the sss1.  We build in /usr/local/pbs on the sss1.  But you could 
   also build and install on a file server that is mounted by the service 
   nodes.

   Note that the stat structure changed sometime between glibc 2.0.7 and
   glibc 2.1.2, so a PBS server, scheduler or MOM built on one will report 
   errors if run on the other.  (Calls to chk_file_sec() will fail.)
   Beware of this if you have different Linux distributions running on
   service nodes and admin or file server nodes.

2. Uncompress and untar the PBS distribution.

3. Move the directory created by this tar file to a directory called
   "patched".  (Or don't, but we'll call it "patched" from now on.)
   If you believe you may want to modify PBS and subsequently build
   a new Cplant patch file, then copy the entire "patched" directory 
   to a directory named "origsrc".  Later on you can create a new 
   patch file by diff'ing origsrc and patched.

   The "patched" directory contains a copy of the PBS adminstrator's
   guide.  You may also want to obtain the External Reference Specification
   for more detailed information about PBS.  This can be built from this
   directory, or obtained from http://mrj.pbs.com.

4. Move the cplantPatch to PBSTOP/patched, and apply it in that directory 
   with this command:

   "patch -N -p1 -l < cplantPatch"

   If you want to know what all these patches do, there's a description
   toward the end of this README.

5. PBSTOP/patched contains all the source files required to build PBS.
   PBSTOP/obj is the target directory that will contain the object files.
   In PBSTOP/obj edit the scripts configBuild and configInstall to have
   the correct pathnames for your build and installation of PBS
   executables and runtime files.  See the comments in the two
   files for the meaning of the values in the file.

   The pathnames in configBuild (except for SRCDIR) are compiled into the 
   code, and so should represent the path name at execution time (from the 
   service nodes).  The pathnames in configInstall are used when copying 
   the executables to their home, and so should represent the pathname on 
   the build machine.

   Also ensure that the directory defined with the "prefix" argument 
   exists and is owned by root. 

8. In PBSTOP/obj execute the configBuild script.  PBSTOP/obj is now set 
   up to be the target directory.  Type "make" to build the code.

9. In PBSTOP/obj execute the configInstall script.  PBSTOP/obj is now set 
   up to copy the compiled PBS codes to the proper location and to build
   a template of the runtime working directory.  Type "make install".

10. The PBS code is built and installed.  Now it is necessary to set up
    the runtime configuration files, and start PBS.

Setting up runtime configuration files:
--------------------------------------

The PBS server, scheduler, MOMs and clients access a runtime directory
for configuration information and logging.  The install process created
a template for this directory.  (It's at the pathname specified by
"set-server-home" in configInstall.)  The configuration files in this
template must be updated with Cplant hostnames and scheduling policy 
configuration parameters, and copied into place on the service nodes.  
Cplant specific configuration files are in PBSTOP/configs.onequeue.  
(This is for the single queue policy.)

I should write a script to do this.  Until then, follow these steps.

All hostnames in the working directory files should be the hostnames to
be used by PBS codes on service nodes when they establish socket connections.
These hostnames will be used by the PBS clients and daemons to establish 
connections with each other and to authenticate incoming requests.

1. Copy the MOMs' config file from PBSTOP/config.onequeue/mom_priv/config 
to the working directory template.  Set these values in the file:

$clienthost - Host names of nodes on which PBS server, scheduler and other
MOMs are running.

$usecp - Directs the MOM to use "cp" instead of "rcp" when copying files
back to users directories.  List file systems that are shared by
the service nodes on which the MOMs are running.

2. Copy the scheduler's configuration files from 
PBSTOP/config.onequeue/sched_priv to the sched_priv directory of the
working directory template.  Edit them as follows:

sched_priv/resource_group - This file should list every PBS user, followed
by a unique ID, followed by a parent group, followed by a priority.  You can
use the user ID from the passwd file for the unique ID, use "root" for 
everybody's parent group, and 10 for everybody's priority.  If all users
have the same priority, you may leave the resource_group file empty.

sched_priv/dedicated_time - see the file for how to set up dedicated time
periods.  It you don't have dedicated time you still need the file.

sched_priv/holidays - lists the prime and non prime time periods, and all
holidays for the year.  You need the prime and non prime periods even if
you are running the "one queue" scheduler, because the scheduler sorts the
jobs in the queue differently during prime and non prime time.

sched_priv/sched_config - lists parameters of the scheduling policy.  You
may be able to copy this straight into the working directory without any
changes.

sched_priv/usage - you need a zero length usage file with permission 644.

3. Copy the server's "nodes" file from PBSTOP/config.onequeue/server_priv
to the server_priv directory of the working directory template.  You need 
one line for each node on which a PBS MOM daemon will run.  Each line should 
be appended with ":ts" to tell the server it is timeshared.  Do not put 
any blank lines or comments in this file.

4. The PBSserver file should contain the hostname of the node on which
the PBS server is running.

5.  Decide where you want the runtime configuration files to go.  Actually
you already decided because it's the set-server-home value in configBuild.
It is recommended that each service node have a local working directory.
If you can't do this, be certain you built with the CPLANT_REMOTE_WORKING_DIR
switch.

6.  Now copy the working directory to each of the service nodes on which
PBS clients, MOMs, the scheduler or the server will run.  This
would be to the location named by "set-server-home" in configBuild.  Use
the -p option in your copy to ensure the permissions are maintained.

Running the PBS server for the first time:
------------------------------------------

The first time you run the PBS server, you must run it with the
argument "-t create", which creates a new database for the server
in the working directory.  Then you need to run "qmgr" and set
several operating parameters.  These parameters are listed in the
file PBSTOP/configs.onequeue/qmgr-setup.  See the PBS External
Reference Specification for an explanation of the values in
the file.

It is important to set the default queue's resources_max.size
attribute.  This attribute defines limit on the size of jobs in
this queue.  If users attempt to submit a job that requests more
nodes than this size, "qsub" will display an error and suggest
the possibility that their job belongs in a dedicated time queue.
(Use qmgr command "set queue default resources_max.size = 75".)
You may want to change this value from time to time as the size
of your compute partition changes.

You must be root on the node the server is running on to enter
the first qmgr commands.  There are qmgr commands that set up other
users as PBS managers, and once those are entered, these other users 
and users on other hosts may enter qmgr commands.

The PBS server verifies the incoming qmgr configuration requests
by checking the hostname of the source of the command.  If the primary
name on the interface is not the name in the PBSserver file, you will get
get an "Unauthorized Request" error when you attempt to configure
the server with qmgr.  By looking at the server's log file in the
server_logs directory, you can see what hostname the server resolved
the incoming request to.

Making a patch file:
-------------------

If you change the PBS source you need to make a new patch file.  This script
would create the patch file when the 2.2 PBS distribution is in "origsrc"
and the Cplant version is in "patched".  (See step 3 above in "Building PBS".)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  
#!/bin/bash
#
export TZ=UTC0
export LC_ALL=C
echo "diff -Naurw origsrc patched > cplantPatch"
diff -Naurw origsrc patched > cplantPatch
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  

The patches
------------

The Cplant patches to PBS 2.2, patch level 8, accomplish the following:

1. We added a "size" resource to PBS.  It's analygous to the yod
   "-size" argument.  It represents the number of compute nodes
   requested by the job.  PBS does not run on our compute nodes.
   They are an abstract resource managed by PBS with help from
   the Cplant runtime system.  The MOM puts the "size" in the
   PBS job's environment so yod can pass it along to the cplant
   node allocator.  (The "nodes" resource in PBS represents the
   service nodes on which the job scripts run.)

   (server/resc_def_all.c, qstat.c, resmom/start_exec.c, fifo/check.c,
   fifo/fairshare.c, fifo/globals.c, fifo/server_info.c, fifo/sort.c
   fifo/sort.h)

2. Changes to qstat display the "size" and "walltime" requests 
   of a job and the total compute nodes in use by jobs.  We also
   display the amount of time the job has been in the queue. 
   (qstat.c)


3. MOM, scheduler and clients use our gethostbypeer() instead
   of gethostname() to determine their host name on the network
   connecting them to the other service nodes.  (With multiple
   network interfaces, gethostname() may return the wrong name.)
   gethostbypeer() uses "ping -R pbs-server-host-name" to find my 
   host name on this network.)

   (qsub.c, net_connect.h, prepare_path.c, Libnet/Makefile.in,
   Libnet/gethostbypeer.c, mom_main.c, pbs_sched.c)  

4. PBS server gets it's hostname from the file in the runtime working
   directory that lists the hostname of the server, instead of from
   gethostname().  (pbsd_main.c)

5. We skip the check of the the full path name starting at root
   when verifying that runtime files and directories have the
   correct permissions.  (Liblog/chk_file_sec.c)

6. Log file entries contain the physical node number of the reporter.
   (Liblog/pbs_log.c, pbsd_main.c, mom_main.c, pbs_sched.c)

7. If the compile time flag CPLANT_SCHEDULER_PING_NODES is set, the
   scheduler checks to see if node is alive before trying to talk_with_mom.
   (If node crashed after server last queried it, scheduler will hang
   trying to talk to it.)  This time-consuming check is far less likely
   to be needed if CPLANT_NONBLOCKING_CONNECTIONS is on.

   (Libnet/node_is_alive.c, Libnet/Makefile.in, node_info.c)

8. MOM uses "$cplant_grace" directive in it's config file to determine
   how long to wait before killing overlimit jobs.  If directive value
   is negative, overlimit jobs are not killed.  MOM only kills overlimit
   jobs if there are jobs in the queues waiting for nodes.  (Scheduler
   notifies MOMs after each scheduling cycle if there are jobs waiting 
   or if there are no jobs waiting.)

   MOMs kill all parallel apps started by a PBS job when the PBS job
   terminates.  It simply invokes the Cplant utility "pingd". 

   MOM adds the "size" resource request to PBS job's environment as PBS_NNODES.

   (catch_child.c, linux/mom_mach.c, mom_main.c)

8. PBS codes only try to lock files in runtime working directory if
   CPLANT_REMOTE_WORKING_DIR is undefined.  (mom_main.c, pbs_sched.c,
   pbsd_main.c)

9. If CPLANT_PRIME_NONPRIME_POLICY is defined, the scheduler only runs
   jobs from the "prime" queue during prime time, refusing to start jobs
   that will run past the end of prime time.  No user is permitted to have
   running jobs that requested more than N/2 node hours total, where N is
   the number of compute nodes in the machine.  During nonprime time, the
   scheduler runs jobs from both the prime and nonprime queue, with higher
   priority given to jobs in the nonprime queue.  (Queues need to be setup
   with priorities this way.)  Nonprime jobs will not be started if they
   would extend into prime time.  If this value is not defined, the
   scheduler ignores prime/nonprime boundaries when scheduling jobs whose
   walltime exceeds the boundary, and the scheduler runs jobs from all
   queues.  (So use sched_config values and queue settings to get desired
   policy.)

   If CPLANT_RUN_ON_SUBMITTING_NODE is defined, the scheduler will try to
   run the job on the node from which the job was submitted, otherwise it
   will run it on the least loaded service node.

   We added two sorts to sort in increasing and decreasing order by the
   job's "size" resource request.

   The scheduler sends a message to all MOMs at the end of each scheduling
   cycle to inform them of whether or not there are jobs in the queues that
   are waiting to run but can't because of a lack of compute nodes.  We
   hacked the Libnet/rm.c facility to do this.

   (check.c, config.h, constant.h, data_types.h, fairshare.c, globals.c,
   fifo.c, globals.h, job_info.c, node_info.c, node_info.h, prime.c,
   queue_info.c, server_info.c, sort.h, sort.c, node_manager.c, Libnet/rm.c)

10.  If the scheduler receives a SIGHUP, it will re-read all of it's
   configuration files.  (It will do all the actions it does on receiving
   a SCH_CONFIGURE command.)  (pbs_sched.c)

11.  If qsub rejects a request because of a resource request that is not
   valid for the queue, information is displayed about how to find the
   resource limits on the default and dedicated time queues. (qsub.c)

12.  If CPLANT_NONBLOCKING_CONNECTIONS is defined at compile time, the
   server will open non-blocking sockets to talk to moms and scheduler and
   will time out and log a warning if the it can't reach the process.  (The
   mom's use this code also to talk to other moms.)  Server will check
   a node's status before attempting connection. (pbs_nodes.h, 
   Libnet/net_client.c, server/node_func.c, server/run_sched.c,
   server/svr_connect.c)

13.  When mom is started with the "-p" option, it will add the jobs
   already running on the node when it started to it's list of jobs
   to poll.  (resmom/catch_child.c)

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -  

The compile switches
--------------------

CPLANT_SERVICE_NODE - general cplant patches for "size" resource,
   use of gethostbypeer, scheduler interface to MOMs for notification
   of waiting jobs, MOMs killing of jobs that go over walltime
   request if there are jobs waiting in the queue, MOMs killing of 
   parallel applications when PBS job completes


CPLANT_PRIME_NONPRIME_POLICY - Define this and rebuild the scheduler
   to turn on Cplant's prime/non-prime policy.  During prime time,
   the scheduler will only run jobs from the "prime" queue.  It
   will not allow any user to have running jobs whose node-hour
   total exceed N/2, where N is the number of nodes available in
   the machine.  It will not start jobs from the "prime" queue
   whose walltime request would extend past the prime time period.
   During non-prime time, the scheduler will run jobs from the
   prime and non-prime queues, with higher priority going to jobs
   in the non-prime queue.  Jobs in the non-prime queue will not
   be started if their walltime request would run past the end
   of non-prime time.

   If this is not turned on, all queues are evaluated during prime
   and nonprime time.  The settings in the scheduler sched_config
   file determine the priority given to the jobs in the queues
   during prime and nonprime time.  We don't care about prime/nonprime
   time boundaries.

CPLANT_RUN_ON_SUBMITTING_HOST - If this is defined, the scheduler
   will schedule each job to run on the service node from which
   it was submitted.  If the MOM is not running on that node, it
   will choose another node.  If this is not defined, the scheduler
   load balances PBS jobs across the service nodes.

CPLANT_REMOTE_WORKING_DIR - define this if the working directory
  is NFS mounted from a remote server.  If it's defined we won't
  try to lock files with fcntl(), which often fails.  It is much
  preferable to place the working directory locally.

CPLANT_NONBLOCKING_CONNECTIONS lib/Libnet/net_client.c will open
   non-blocking sockets, so server and mom don't hang when connecting
   to a mom that's exited or a mom on a node that's crashed.  Also,
   server will check state of a node (in server's data structure)
   before attempting to connect.  Server will log a big warning if
   a node is unreachable.



Lee Ann Fisk 2001-06-25