Cplant Fault Recovery Patch for PBS
October, 2001
Lee Ann Fisk, lafisk@sandia.gov
(contact: Jeanette Johnston, jjohnst@sandia.gov)
What this is:
------------
The patch file that creates PBS for Cplant is quite large. The
cplantFRpatch file in this directory is a subset of that patch
file. It contains fault recovery code only. It would be
applicable to non-Cplant sites as well as Cplant sites.
How to patch PBS:
-----------------
The patch file was built from Open PBS v 2.2, patch level 8. It
will probably patch later revisions without trouble. If not, the
patches are simple and you should be able to read the patch file
and patch your code manually.
[7/9/01 - I downloaded Open PBS 2.3.12 and it patched
and compiled without error - lafisk]
We build PBS on Linux/Alpha machines. We have thousands, running
everything from Red Hat 5.1 to Red Hat 6.1. The patches just use
libc functions and will most likely build and run with the desired
result on other systems as well.
To patch the PBS source, cd to the top of your PBS source tree
(where "src" and "doc" and "configure" are) and (assuming the
patch file is here too) :
patch -N -p1 -l < cplantFRpatch
(I'm using "patch" version 2.5, Larry Wall, Free Software Foundation.)
The new code is ifdef'd out. You need to define CPLANT_SERVICE_NODE
and CPLANT_NONBLOCKING_CONNECTIONS to get the patches included when
you compile. The two problems solved by these two enhancements are
described below.
Problem 1:
----------
The first problem is that every scheduling cycle, the server sends a
list of MOMs to the scheduler (we use the FIFO scheduler). The scheduler
tries to contact each MOM to get resource information so it can make
an intelligent scheduling decision. If the MOM or the MOM's node
is no longer talking, the scheduler hangs for three minutes (or
whatever number of seconds it's "-a" argument specified) and then
takes an alarm and exits.
The patches ifdef'd with CPLANT_SERVICE_NODE make it far less likely
that the server will hand the MOM a bad node. It can still happen,
but the window between when the server tests the state of the MOM
node and when the server hands the scheduler a list of MOMs is greatly
reduced.
This message I sent to the PBS users list explains the details:
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
From lafisk Thu Apr 27 09:13:43 2000
Subject: Re: [PBS-USERS] machine crash cause PBS to cease op
To: tim.leight@evsx.com (Timothy S. Leight)
Date: Thu, 27 Apr 2000 09:13:43 -0600 (MDT)
Cc: berend@growthnetworks.com (Berend Ozceri),
hender@pbspro.com (Bob Henderson), pbs-users@pbspro.com ('PBS Users')
In-Reply-To: <3908474B.88DEF804@evsx.com> from "Timothy S. Leight" at Apr 27, 2000 01:57
:31 PM
X-Mailer: ELM [version 2.5 PL2]
Content-Length: 2693
Status: OR
I greatly reduced the likelihood of the scheduler getting a
bad node from the server with these three changes to ping_nodes()
in server/node_manager.c. (The server normally pings nodes every
5 minutes, and only if they are in an unknown state or some other
routine in the server marked them as needing a ping. And it
doesn't ping nodes it believes are running a job.)
Remove this code:
if (np->nd_state & (INUSE_JOB|INUSE_JOBSHARE)) {
if (!(np->nd_state & INUSE_NEEDS_HELLO_PING))
continue;
}
It causes the server to skip nodes that are running a job.
Replace the NEEDS_HELLO check like this:
#ifdef CPLANT_SERVICE_NODE
/*
** In our environment, nodes are down until proven otherwise
*/
com = IS_HELLO;
np->nd_state |= INUSE_DOWN;
#else
if (np->nd_state & INUSE_NEEDS_HELLO_PING)
com = IS_HELLO;
#endif
The IS_HELLO requires an acknowledgement from the node and the state
of the node is set to DOWN until we hear from it.
And set the ping interval to your taste. We are pinging all nodes
every 2 minutes:
#ifdef CPLANT_SERVICE_NODE
/*
** Let's try a ping every 2 minutes.
*/
i = 120;
#else
i = 300; /* relaxed ping rate for normal run */
#endif
There is still a window of time where a node can crash after the
server has pinged it and before the scheduler is invoked. But this
rarely happens now. I haven't seen a scheduler toolong alarm in
quite a while now.
Also, ping nodes uses datagram sockets so it doesn't hang like
the connections made for qstat.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Problem 2:
----------
The second problem is that the server hangs if it tries to contact
a MOM on a dead node. The solution implemented here is to use non-blocking
sockets and timeout with an error.
This code is ifdef'd CPLANT_NONBLOCKING_SOCKETS.
These are the affected files:
lib/Libnet/net_client.c - In client_to_svr() open non-blocking sockets,
wait 5 seconds for the connection, and return PBS_NET_RC_RETRY
if connection times out.
include/pbs_config.h.in - Redefine read() and write() to check EAGAIN.
pbs_config.h is conveniently included in every file.
server/node_func.c - New function bad_node_warning() writes a
warning to server's log file if MOM or scheduler can't be reached. It
writes no more than once per hour per node. It also uses set_task
to schedule a trip to the ping_nodes function. ping_nodes will
discover the node is down and set the appropriate status fields for
the node. For this to work you need CPLANT_SERVICE_NODE defined
(that's Problem 1) so that ping_nodes will be sure to ping the node.
New function addr_ok() tests if a node is down or OK.
pbs_nodes.h - Add a field in struct pbsnode that notes at what time
the last warning was written to the log file that the node is down.
server/run_sched.c - In contact_sched(), test addr_ok() before contacting
the scheduler. Return EHOSTDOWN if it's not OK. If it is
OK, and if connection to scheduler fails, call bad_node_warning().
NOTE: This is commented out, since the default case is to run the
scheduler on the same node as the server. If you run the scheduler
on a node listed in the server's nodes file, then uncomment this
check before compiling to prevent server hangs when contacting
the scheduler on a dead node.
server/svr_connect.c - In svr_connect(), test addr_ok() before contacting
a MOM. Return EHOSTDOWN if !addr_ok(). If socket connection to MOM
fails, call bad_node_warning().
That's all it takes.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Revision history:
3/29/01 - implemented fix to node_func.c sent in by chuck@primaryknowledge.com:
in addr_ok(), test if node state is (INUSE_OFFLINE|INUSE_DELETED) before
accessing nd_addrs[0].
10/25/01 - added a change to conn_qsub so MOM uses a *blocking* socket for
connection to interactive ("qsub -I") PBS job. Otherwise MOM loops in
attempt to to read from interactive shell, thereby hogging the CPU. Sent
in by Gary.Skouson@pnl.gov.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Other notes:
Date: Thu, 17 May 2001 17:57:17 -0400
From: Pete Wyckoff
To: Matt Harrington
Cc: pbs-users@openpbs.com
Subject: Re: [PBS-USERS] What to do when a node goes down?
matt@msg.ucsf.edu said:
> I frequently drop a node and then my whole PBS system is not happy. What
> is the best way to deal with a node which is down? I run pbs 2.2 and 2.3
> on separate groups of machines.
Cplant reliability patch.
http://www.cs.sandia.gov/cplant/doc/pbs/pbs.html
Be warned that it triggers a compiler bug in gcc "2.96" as shipped with
linux redhat 7.0 for x86, at least in our pbs-2.3.12 tree which also has
a few other little patches. I can give you the code which puts the
write_nonblocking... functions into a C file instead of letting the
compiler generate bad code on the static inline. The sane thing to do
would be to upgrade your compiler or distro, however.
-- Pete
=============================================================================
Lee Ann Fisk Phone: 505-844-2059
Scalable Computing Systems Department (9223) FAX: 505-845-7442
Sandia National Labs, Mail Stop 1110 Email: lafisk@mp.sandia.gov
Albuquerque, NM 87185-1110 http://www.cs.sandia.gov/cplant
=============================================================================