[Top] [Contents] [Index] [ ? ]

Cplant Server Library Reference

This document describes the communications library developed for the utilities, daemons and servers that manage and query the Cplant runtime environment. It provides an interface to a library where small control messages may be exchanged, and bulk data transfers may be requested. Some group formation and group communication functionality is provided, permitting daemons to form groups temporarily to provide a service. Further functionality provides error detection, recovery and reporting for point to point and group communication.

1. Overview  Library overview
2. Initialization  Initializing and freeing library
3. Control portal functions  Control portals
4. Data portal functions  Data portals
5. Put messages  
6. Get messages  
7. Group formation and communication  Group formation, communication
8. Errors and logging  Error detection, recovery and reporting


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

1. Overview

This document describes the functions in the Cplant server library, also known as libsrvr.a. These functions provide an interface to the underlying portals layer allowing processes to communicate point- to-point with portals messages, and to form temporary groups and do fault tolerant collective communication.

It is anticipated that Cplant parallel applications launched with yod will use the standard and much fuller featured Cplant MPI library for communication. The server library is used transparently by applications to communicate with IO servers and yod. It is also used by yod and pingd to contact the bebopd node allocator, by the PCT deamons on the compute nodes, and by sundry other services running throughout the system.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2. Initialization

A Cplant application (that is, any code run with yod) that issues server library calls does not need to initialize the server library. This is done in the special Cplant code linked with the application that executes before user code. If you are working with a Cplant application you can skip this chapter. For Cplant servers, utilities, tests, etc. read on.

These other codes must initialize the library explicitly and obtain a portal process ID. The portal process ID is the identifier that uniquely identifies a process that can send and receive portals messages on a node. (For technical reasons we don't use the system's process ID.) In addition, codes that use the collective operations of the server library must make another call to initialize that part of the library.

This section describes the calls that initialize use of the server library.

2.1 register_ppid  Acquire a portal process ID
2.2 server_library_init  Initialize server library structures
2.3 server_coll_init  Initialize collective library structures
2.4 server_library_done  Release server library structures
2.5 example  usage example


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.1 register_ppid

#include "ppid.h"

Function: ppid_type register_ppid(PROCESS_PCB_TYPE *pcb, ppid_type ppid, gid_type gid)

pcb
This is a structure (Process Control Block) set up to communicate process specific information to system code. If your code has been linked with the Cplant startup code, this argument must be the external variable _my_pcb. If not, you must set up your own Process Control Block. See the Cplant startup code source (startup.c) for more information.

ppid
This argument is used to request a particular portal ID (for servers that would like to be accessible at a well known ID), or to request to be assigned a free portal ID. Use the value PPID_AUTO to be assigned a portal process ID by the system. Use of fixed portal IDs is limited. See the `ppid.h' header file for a list of fixed IDs already in use by Cplant servers. Fixed IDs range from 1 to MAX_FIXED_PPID. To request a fixed ID, enter it as the ppid argument. The process must be running as root to request a fixed ID.

gid
Set this variable to assign a group ID to the process. Group IDs are significant in Cplant parallel applications, but are just a number for other portals processes. I believe any number entered here will be fine. At this writing it is stored and never examined. The header file `ppid.h' lists group IDs used by some of the servers.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.2 server_library_init

#include "srvr_comm.h"

Function: int server_library_init(void)

This function initializes the server library structures. It must be called before calling any other server library functions.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.3 server_coll_init

#include "srvr_coll.h"

Function: int server_coll_init(void)

This function initializes the collective operations (both group membership formation and message passing) of the server library. Call this after calling server_library_init if you intend to use this part of the library.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.4 server_library_done

#include "srvr_comm.h"

Function: int server_library_done(void)

This function releases the resources used by the server library, including those used by the membership formation and collective operations if they were initialized by a call to server_coll_init.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

2.5 example

This is a portion of the test `ts_srvr_coll_comm', which tests the membership formation and collective communication functions of the server library. It is linked with the Cplant startup code in source file startup.c, where the structure named _my_pcb is initialized.

 
#include "ppid.h"
#include "srvr_comm.h"
#include "srvr_coll.h"
#include "srvr_err.h"

main(int argc, char *argv[])
{
char *c;
int rc;

    /*
    **  Set my portal ID - init the server library
    */
    _my_ppid = register_ppid(_my_pcb, PPID_TEST, GID_TEST);

    if (_my_ppid != PPID_TEST){
        log_error("Can not register myself as PPID=%d\n", PPID_TEST);
    }

    if (server_library_init()){
         log_error("Can't init server library");
    }

    /********************************************************
    ** TEST 1
    ** Initialize collective communications of server library
    *********************************************************/

    rc = server_coll_init();

    if (rc != DSRVR_OK){
         coll_lib_error(rc, "server_coll_init");
    }

    /*
    ** Create a group.
    */
    rc = dsrvr_member_init(totnodes, my_rank, GID_TEST);

Then it starts testing. Here is the end of the routine started above.

 
    printf("PASSED!!! - resetting of collective library passed.\n");


    /********************************************************
    ** DONE
    *********************************************************/

    server_library_done();

    exit(0);
}



[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3. Control portal functions

This section describes functions that create a control portal, send a message to a control portal, poll a control portal, and delete a control portal.

Control portal messages may carry a small amount of information, and/or may initiate a put or get request.

3.1 Control message fundamentals  Use and care of control portals
3.2 control_msg_handle structure  The control message handle and it's accessors
3.3 srvr_init_control_ptl  Initializing a control portal
3.4 srvr_init_control_ptl_at  Specifying portal number at initialization
3.5 srvr_release_control_ptl  Freeing a control portal
3.6 srvr_send_to_control_ptl  Sending a message to a control portal
3.7 srvr_get_next_control_msg  Get next message at control portal
3.8 srvr_free_control_msg  Free a control message slot for re-use
3.9 srvr_free_all_control_msgs  Free all control message slots for re-use
3.10 Example  A control portal in action


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.1 Control message fundamentals

A control portal is established for the purpose of receiving small messages from other processes, including requests for transfer of large amounts of data. We call a message arriving at a control portal a control message. The message may carry a user defined message type and a small amount of user data. The maximum size in bytes of the user data buffer is specified by SRVR_USR_DATA_LEN, which is defined in `srvr_comm.h'.

The protocol layer above the server library may use message types to communicate whether an outgoing control message is just a bearer of a small message or whether it is a put or get request. Simple control messages are sent with a call to srvr_send_to_control_ptl. Requests to put a data buffer in the receiver's memory are sent with srvr_comm_put_request. Requests to get a data buffer from the receiver are sent with srvr_comm_get_request.

When an incoming message is accessed by the receiver, the library RETURN a handle for the message. If the arriving control message is just carrying user data, the handle may be used to access the data and sender information. If the control message is a put or get request, then the receiver calls srvr_comm_put_reply or srvr_comm_get_reply with the handle to process the data transfer request.

To send a control message to a process, you must have three things:

The receiver's physical node ID is the value of it's global variable _my_pnid, a 32 bit unsigned integer. The receiver's portal process ID is is the value of it's global variable _my_ppid, a 16 bit unsigned integer. The portal ID is the value returned to the receiver when it called srvr_init_control_ptl. (Alternatively, the receiver could have requested a specific portal ID when establishing the control portal with srvr_init_control_ptl_at.) It is the library user's responsibility to obtain these values for the sender. Since the server library was written to provide communication for client/server processes, or for mutable groups (groups that form, disband, and reform with different members), a process is provided no map at initialization time of other processes with which it communicates.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.2 control_msg_handle structure

When a message arrives in a control portal and is accessed by the receiver, a handle is returned. The handle has information about the sender, up to SRVR_USR_DATA_LEN bytes of user data, and information required by the server library for freeing the message or complying with a put or get request.

The handle has type control_msg_handle, and is defined in srvr_comm_ctl.h. SRVR_USR_DATA_LEN is defined in srvr_comm.h.

The actual fields in the handle are implementation dependent. It's fields are accessed with these macros, where h is the handle (not a pointer to the handle):

SRVR_HANDLE_NID(h)
The physical node ID of the sender.
SRVR_HANDLE_PID(h)
The portal process ID of the sender.
SRVR_HANDLE_TYPE(h)
The message type set by the sender.
SRVR_HANDLE_RET_PTL(h)
The sender's return portal for put or get requests.
SRVR_HANDLE_TRANSFER_LEN(h)
The length in bytes of the data to be transferred by the put or get request.
SRVR_HANDLE_USERDEF(h)
A pointer to the user data.
SRVR_HANDLE_MATCHBITS(h)
The matchbits required for the return operation (the put or get reply), a portals thing you need not worry about.

These two macros are useful in processing handles:

SRVR_CLEAR_HANDLE(h)
Initializes a handle to an unused state.
SRVR_IS_VALID_HANDLE(h)
Tests if a handle's field are set to valid values, if they represent an actual received control message.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.3 srvr_init_control_ptl

#include "srvr_comm.h"

Function: int srvr_init_control_ptl (int max_num_msgs)

max_num_msgs
The maximum number of messages in the control portal at any point in time.
RETURN
The portal index, or SRVR_INVAL_PTL on failure.

A control portal is required if control messages are to be received, or if requests to put or get data are to be received. This function creates a control portal.

The control portal created will be able to store up to max_num_msgs messages at any point in time. It RETURN the index of the portal created. Control messages slots can be freed for re-use, so the portal need not have storage for all potential incoming messages.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.4 srvr_init_control_ptl_at

#include "srvr_comm.h"

Function: int srvr_init_control_ptl_at (int max_num_msgs, int portal_id)

max_num_msgs
The maximum number of messages in the control portal at any point in time.
portal_id
The portal ID you request this control portal to have.
RETURN
0 (zero) on success, or SRVR_INVAL_PTL on failure.

You may wish to create a control portal at a specified portal index, which is well known to client programs. The range of valid portal indices is given by MINUSERPTL and MAXUSERPTL. The function attempts to create a control portal with the portal index you specify.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.5 srvr_release_control_ptl

#include "srvr_comm.h"

Function: int srvr_release_control_ptl (int portal_id)

portal_id
The portal index specifying the portal to be freed.
RETURN
0 (zero) on success, or -1 on failure.

This functions free all resources associated with a control portal. Upon successful completion, the portal index may be reused.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.6 srvr_send_to_control_ptl

#include "srvr_comm.h"

Function: int srvr_send_to_control_ptl (int nid, int pid, int portal_id, int msg_type, char *user_data, int len)

nid
The physical node ID of the receiver.
pid
The portal process ID of the receiver.
portal_id
The portal index of a control portal established by the receiver for messages like this one.
msg_type
An optional tag which the receiver can check.
user_data
Pointer to a buffer containing a small amount of user defined information.
len
The length in bytes of the user_data buffer.
RETURN
0 (zero) on success, or -1 on failure.

This function sends a control message to a remote process. The function RETURN when the underlying message passing code is through setting up the send and the user_data buffer (if any) may be reused.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.7 srvr_get_next_control_msg

#include "srvr_comm.h"

Function: int srvr_get_next_control_msg (int portal_id, control_msg_handle *var{handle}, int *msg_type, int *xfer_len, char **user_data)

portal_id
The portal index to search for a message.
handle
A pointer to a control_msg_handle which (if non-NULL) will be filled in by the function if a new control message is found in the portal.
msg_type
A pointer to a field which (if non-NULL) will be set by the function to the message type of the new control message found in the portal.
xfer_len
A pointer to a field which (if non-NULL) will be set by the function to the length in bytes of the put or get request in the new control message found in the portal.
user_data
A pointer to a field which (if non-NULL) will be set by the function to a pointer to the user_data in the new control message found in the portal.
RETURN
-1 on failure, 0 (zero) if no message is found, or 1 if a new control message was found in the portal.

This function is used to receive a control message from a control portal. The semantics for the Portals 2 version and the Portals 3 version differ slightly.

In both versions, if the caller provides the address of a control_msg_handle, it will be written by the function with sender information, a pointer to the user data (if any), and information about the put or get request (if any). The information in the handle may be accessed by using the macros described at the beginning of this chapter. The caller must retain the handle in the event he/she wants to free the control message, or reply to the put or get request it specified.

In both versions, if msg_type or xfer_len are non-NULL, they are set by the function to the associated value of the received control message. (These values are also accessible by using the accessor macros on the handle.) If user_data is non-NULL, it is set to point to the user defined data sent with the control message. Be aware that this buffer may be overwritten at any time after the control message is freed.

In both versions, if the handle is NULL, the control message is freed by the library before the function return, since the library realizes you will not be able to free it later without a handle.

In the Portals 2 version of this call, the function returns the next control message in the portal, in FIFO order. In the Portals 3 version of the library, the handle passed in to srvr_get_next_control_msg can be used to specify matching criteria. If no matching criteria is given, the next control message in the portal is returned.

In the Portals 3 version, it is essential to clear the handle with the SRVR_CLEAR_HANDLE macro before calling srvr_get_next_control_msg. Otherwise, garbage in the handle fields may be interpreted as matching criteria. To specify matching criteria (after clearing the handle), use the SRVR_HANDLE_NID, SRVR_HANDLE_PID, and/or SRVR_HANDLE_TYPE macros described at the beginning of this chapter to specify the sender's physical node ID, sender's portal process ID, and/or message type of the message to be returned.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.8 srvr_free_control_msg

#include "srvr_comm.h"

Function: int srvr_free_control_msg(int portal_id, control_msg_handle *handle)

portal_id
The index of control portal in which the control message is stored.
handle
The handle returned by srvr_get_next_control_msg.
RETURN
0 (zero) on success, or -1 on failure.

This function frees a control message slot in a control portal for re-use.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.9 srvr_free_all_control_msgs

#include "srvr_comm.h"

Function: int srvr_free_all_control_msgs(int portal_id)

portal_id
The index of control portal
RETURN
0 (zero) on success, or -1 on failure.

This function essentially resets a control portal so that all it's slots for incoming messages are free.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

3.10 Example

Below are two sample programs, a receiver and a sender. The receiver sets up a portal to receive control messages and publicizes it's physical node ID, portal process ID and control portal index. The sender sends a control message to the receiver.

The overview explained the difference between Cplant parallel applications and stand alone processes that use portals. The example here shows stand alone processes.

The receiver of the control message:

 
#include <stdio.h>
#include "srvr_comm.h"

extern nid_type _my_pnid;
extern ppid_type _my_ppid;

main()
{
int portal_id;
int rc;
control_msg_handle handle;
int type;
char *userdef;

    assign_ppid(); /* since I'm not part of a Cplant parallel app */

    server_library_init();

    portal_id = srvr_init_control_ptl(1);

    if (portal_id == SRVR_INVAL_PTL){
        printf("Can't create control portal (%s)\n",
            CPstrerror(CPerrno));
        exit(-1);
    }

    printf("My physical node ID: %d\n",_my_pnid);
    printf("My portal process ID: %d\n",_my_ppid);
    printf("My control portal index: %d\n",portal_id);
    printf("OK to start sender process.\n");

    SRVR_CLEAR_HANDLE(handle);

    while (1){
        rc = srvr_get_next_control_msg(portal_id, &handle,
                 &type, NULL, &userdef);

        if (rc == 1) break;

        if (rc == -1){
            printf("Error polling control portal (%s)\n",
                CPstrerror(CPerrno);
            exit(-1);
        }
    }

    printf("Receiver got a control message.\n");
    printf("Source nid %d, ppid %d, type %d\",
                  SRVR_HANDLE_NID(handle),
                  SRVR_HANDLE_PID(handle),
                  SRVR_HANDLE_TYPE(handle));  /* also in "type" */

    printf("User data %s\n",SRVR_HANDLE_USERDEF(handle));

    srvr_release_control_ptl(portal_id);

    server_library_done();
}

The sender of the control message:

 
#include <stdio.h>
#include "srvr_comm.h"

static char userdefBuf[SRVR_USR_DATA_LEN];

main(int argc, char *argv[])
{
nid_type nid;
ppid_type pid;
int portal_id;
int rc;

    if (argc < 4){
       printf (
       "Sender needs receiver nid, pid and portal id, in that order\n");
       exit(-1);
    }

    assign_ppid(); /* since I'm not part of a Cplant parallel app */

    server_library_init();

    nid = (nid_type)atoi(argv[1]);
    pid = (ppid_type)atoi(argv[2]);
    portal_id = atoi(argv[3]);

    sprintf(userdefBuf,"Hi from %d/%d\n",_my_nid,_my_ppid);

    rc = srvr_send_to_control_ptl(nid, pid, portal_id, 999,
                   userdefBuf, strlen(userdefBuf) + 1);

    if (rc == -1){
        printf("Sender: error sending to %d/%d/%d (%s)\n",
                nid, pid, portal_id, CPstrerror(CPerrno));
    }
    else{
        printf("Sender: little message sent to %d/%d/%d\n",
                   nid, pid, portal_id);
    }

    server_library_done();
}


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4. Data portal functions

This section describes functions that manipulate data portal buffers. A user's buffer is attached to a data portal when that user issues a put or get request.

The functions described here determine whether a buffer has been read (subsequent to a put request) or written (subsequent to a get request). There is also a function to remove the user's buffer from the data portal when the operation(s) on the buffer have completed.

4.1 Data portal fundamentals  What is a data portal for?
4.2 srvr_delete_buf  Remove a buffer from a data portal
4.3 srvr_test_write_buf  Test if buffer has been written by a remote process
4.4 srvr_test_read_buf  Test if buffer has been read by a remote process


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.1 Data portal fundamentals

When you send a put request to a remote process or processes (with srvr_comm_put_req), you specify a buffer containing the data you wish to send. This buffer is attached to a data portal until you call srvr_delete_buf indicating that you have determined that all remote processes have retrieved the data.

If you send a get request to a remote process (with srvr_comm_get_req), the buffer to contain the incoming data is again attached to a data portal. It remains there until you call srvr_delete_buf after determining the data has arrived.

There are no examples in this section. The examples in the put and get sections illustrate data portal buffer usage.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.2 srvr_delete_buf

#include "srvr_comm.h"

Function: int srvr_delete_buf(int slot)

slot
The slot number in the data portal of the buffer. This is returned by srvr_comm_put_buffer. It is also the handle returned by a non-blocking call to srvr_comm_get_req.
RETURN
0 (zero) on success, -1 on failure.

This function frees a slot in the data portal. You are advised to free data portal slots when they are no longer needed.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.3 srvr_test_write_buf

#include "srvr_comm.h"

Function: int srvr_test_write_buf (int slot)

slot
The slot in the data portal containing the buffer. (This was returned by srvr_comm_get_req.)
RETURN
0 (zero) if the data has not arrived, 1 if the data has arrived, -1 on error.

After issuing a get request to a remote process you may want to know if the remote process has responded by sending the data. This function tests for receipt of the data.

#include "srvr_comm.h"


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

4.4 srvr_test_read_buf

Function: int srvr_test_read_buf (int slot, int count)

slot
The slot in the data portal containing the buffer. (This was returned by srvr_comm_put_buffer.)
count
The total number of accesses to the put buffer you were expecting.
RETURN
0 (zero) if fewer that count remote processes have read the data in the buffer, 1 if at least count remote processes have read the data, -1 on error.

After issuing a put request to one or more remote processes, you may want to know when all remote processes have retreived the data and you may reuse the data buffer. This function will provide that information.

Note that the remote side may be replying to the put request by pulling in the data in fixed size blocks (using srvr_comm_put_reply_partial). The count in this case is the number of fixed sized blocks read from the buffer.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5. Put messages

A put message in the Cplant server library is a control message sent from a process to one or more remote processes requesting to send a large message to them. Large means any message too large to pack into a control message's user data buffer.

This section describes the functions that initiate a put request and reply to a put request. The function that tests for completion of a put request (srvr_test_read_buf), and the function that releases the slot in the data portal used by the put buffer (srvr_delete_buf) are described in the section on data portals.

5.1 srvr_comm_put_req  Send a request to put data in memory of a remote process
5.2 srvr_comm_put_reply  Reply to a request from a remote process to put data in your memory
5.3 srvr_comm_put_reply_partial  Receive only part of the data from the remote process
5.4 Example  Sample code


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.1 srvr_comm_put_req

#include "srvr_comm.h"

Function: int srvr_comm_put_req (char *buf, int buflen, int msg_type, char *user_data, int datalen, int ntargets, int *nidlist, int *pidlist, int *ptllist)

buf
Pointer to buffer containing put data.
buflen
Length in bytes of put data buffer.
msg_type
An optional message type that is sent with the put request.
user_data
Optional pointer to a buffer containing a small amount of user defined information. The processes receiving the control message can access this data.
datalen
Length in bytes of the user_data buffer, 0 if no user_data.
ntargets
The number of remote processes you wish to come and pick up the put buffer.
nidlist
List of the physical node IDs of the put targets.
pidlist
List of the portal process IDs of the put targets.
ptllist
List of the control portals at which the remote processes are checking for messages like this one.
RETURN
On success, a handle is returned with which to check for completion. On failure, -1 returned.

This function sends a put request to the control portals of remote processes. It is sent to request that the remote processes pull data from a buffer in our process' address space. It returns when the request has been sent out and the user_data buffer (if specified) may be reused.

Use srvr_comm_read_buf to determine when all remote processes have pulled the buffer from the data portal. This function is described in the section on data portals.

On failure the global variable CPerrno will be set to one of these values:

EINVAL
Invalid command line arguments.
EPORTAL
Failure in message passing layer, maybe out of some local resource.
ESENDTIMEOUT
Message send did not complete, maybe one of the target nodes has crashed.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.2 srvr_comm_put_reply

#include "srvr_comm.h"

Function: int srvr_comm_put_reply (control_msg_handle *mh, void *recv_buf, int buf_len)

mh
The control message handle set by the call to srvr_get_next_control_msg when the put request was received.
recv_buf
The address of a buffer at which the put data may be written.
buf_len
The length in bytes of recv_buf.
RETURN
0 (zero) on success, -1 on failure.

This function is used to reply to a put request from a remote process. (The put request arrived in a control message. The users of the server library determine most likely by use of message types that the message is a put request.) The function returns when the data has arrived in recv_buf.

On failure the global variable CPerrno may be set to this value:

ERECVTIMEOUT
The data did not arrive within a reasonable time. Perhaps the process that initiated the put has exited.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.3 srvr_comm_put_reply_partial

#include "srvr_comm.h"

Function: int srvr_comm_put_reply_partial (control_msg_handle *mh, VOID *recv_buf, int buf_len, int offset)

mh
The control message handle set by the call to srvr_get_next_control_msg when the put request was received.
recv_buf
The address of a buffer at which the put data may be written.
buf_len
The length in bytes of recv_buf.
offset
An offset into the put buffer on the remote node from which to begin the transfer
RETURN
0 (zero) on success, -1 on failure.

This function allows the receiver to bring in the put data in blocks. For example, the receiver may make 8 calls to bring the put data in in 8 blocks, incrementing the offset each time to get the next block of the buffer from the remote process. The function returns when the data has been copied to recv_buf. (The offset applies to the buffer at the remote process, not to recv_buf.)

Note that each call counts as an access to the put buffer on the originating side, so calls to srvr_test_read_buf of the originating process will count one access for each.

On failure the global variable CPerrno may be set to this value:

ERECVTIMEOUT
The data did not arrive within a reasonable time. Perhaps the process that initiated the put has exited.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

5.4 Example

The first program below receives a put request and then accepts the data transfer. The second program sends the put request to the first program. They are written as stand alone portals processes, not as a Cplant parallel application. Since they request the same portal process ID they would need to be run on different nodes. Since they don't let the system pick their portal ID, they need to be run as root. (See the overview for more explanation.)

 
#include <stdio.h>
#include <malloc.h>
#include "ppid.h"
#include "srvr_comm.h"

#define EXAMPLE_PUT  0x0101

#define TEST_PORTAL 22

extern nid_type _my_pnid;
extern ppid_type _my_ppid;

main()
{
int portal_id;
int rc, i;
control_msg_handle handle;
int type, xferLen;

    /*
    **  Set my portal ID - init the server library
    */
    _my_ppid = register_ppid(_my_pcb, PPID_TEST, GID_TEST);

    if (_my_ppid != PPID_TEST){
        log_error("Can not register myself as PPID=%d\n", PPID_TEST);
    }

    if (server_library_init()){
         log_error("Can't init server library");
    }

    /*
    ** Need a control portal to receive the put request.
    */
    rc = srvr_init_control_ptl_at(1, TEST_PORTAL);

    if (rc == SRVR_INVAL_PTL)
        printf("Can't create control portal (%s)\n",
            CPstrerror(CPerrno));
        exit(-1);
    }

    printf("My physical node ID: %d\n",_my_pnid);
    printf("My portal process ID: %d\n",_my_ppid);
    printf("My control portal index: %d\n",portal_id);
    printf("OK to start sender process.\n");

    SRVR_CLEAR_HANDLE(handle);

    while (1){
        rc = srvr_get_next_control_msg(TEST_PORTAL, &handle,
                 &type, &xferLen, NULL);

        if (rc == 1) break;

        if (rc == -1){
            printf("Error polling control portal (%s)\n",
                CPstrerror(CPerrno);
            exit(-1);
        }
    }

    if (type != EXAMPLE_PUT){
       printf("Unexpected message received\n");
       exit(-1);
    }

    printf("Receiver got a Put request from %d/%d for %d bytes.\n",
                  SRVR_HANDLE_NID(handle),
                  SRVR_HANDLE_PID(handle),
                  xferLen);

    recvBuf = (char *)malloc(xferLen);

    if (!recvBuf){
        printf("Receiver: malloc problem\n");
        exit(-1);    
    }

    rc = srvr_comm_put_reply(&handle, recvBuf, xferLen);

    if (rc == -1){
         printf("Receiver: error in put reply (%s)\n",
                      CPstrerror(CPerrno));
    }
    else{
        for (i=0; i<xferLen); i++){
            if (recvBuf[i] != i%8){
                 printf("Receiver: Unexpected put data came in.\n");
                 break;
            }
        }
    }

    srvr_release_control_ptl(TEST_PORTAL);

    server_library_done();
}

Here is the initiator of the put request:

 
#include <stdio.h>
#include <malloc.h>
#include "ppid.h"
#include "srvr_comm.h"

#define EXAMPLE_PUT  0x0101
#define TEST_PORTAL  22

char putBuf[10000];

main(int argc, char *argv[])
{
int destptl; 
int destnid; 
int destpid; 
int rc, i, handle;

    if (argc < 2){
       printf ("Sender needs receiver nid.\n");
       exit(-1);
    }

    /*
    **  Set my portal ID - init the server library
    */
    _my_ppid = register_ppid(_my_pcb, PPID_TEST, GID_TEST);

    if (_my_ppid != PPID_TEST){
        log_error("Can not register myself as PPID=%d\n", PPID_TEST);
    }

    if (server_library_init()){
         log_error("Can't init server library");
    }

    /*
    ** buffer to put in remote process' memory
    */
    for (i=0; i<10000; i++){
        putBuf[i] = i%8;
    }
   
    destnid = atoi(argv[1]);
    destptl = TEST_PORTAL;
    destpid = PPID_TEST;

    handle = srvr_comm_put_req(putBuf, 10000, EXAMPLE_PUT,
                          NULL, 0,
                          1, &destnid, &destpid, &destptl);

    if (rc == -1){
        printf("Sender: error sending put request to %d/%d/%d (%s)\n",
                destnid, destpid, destptl,
                CPstreror(CPerrno));
        exit(-1); 
    }

    printf("Sender: Put request has been sent to %d/%d/%d\n",
                       destnid, destpid, destptl);

    while ((rc = srvr_test_read_buf(handle, 1)) != 1){

        if (rc == -1){
            printf(
             "Sender: error waiting for put data to be picked up (%s)\n",
                CPstreror(CPerrno));
            exit(-1); 
        } 
    }

    printf("Sender: Put data has been picked up by remote process.\n");

    server_library_done();
}


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6. Get messages

A get message in the Cplant server library is a message sent from process A (the initiator) to remote process B (the receiver) requesting that process B send a large message to process A.

This section describes functions that initiate a get request and reply to a get request. The function that tests for completion of a get request at the initiator's end, srvr_test_write_buf, is described in the section on data buffers.

6.1 srvr_comm_get_req  Send a request to get data from a remote process
6.2 srvr_comm_get_reply  Reply to a remote process' get request
6.3 srvr_comm_get_reply_partial  Reply by sending data in multiple blocks
6.4 Example  Example code


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.1 srvr_comm_get_req

#include "srvr_comm.h"

Function: int srvr_comm_get_req(char *buf, int len, int msg_type, char *user_data, int user_data_len, int nid, int pid, int portal_id, int blocking, double tmout)

buf
Location where data from remote process should be deposited.
len
Length in bytes of requested data.
msg_type
An optional tag that accompanies the get request to the remote node.
user_data
Pointer to optional buffer containing small amount of user defined information that will be sent along with the request.
user_data_len
Length in bytes of user_data buffer.
nid
Physical node ID of receiver.
pid
Portal process ID of receiver.
portal_id
Portal index for a control portal at which receiver is checking for your message.
blocking
BLOCKING if you wish call to block until data is received, NONBLOCKING if you wish call to return as soon as user_data buffer may be reused.
timeout
Number of seconds before blocking call will timeout. Set to 0.0 to block indefinitely.
RETURN
If call is non-blocking, a handle is returned on success, -1 is returned on failure. If the call is blocking, 0 (zero) is returned on success, -1 is returned on error.

This function is used to send a request to a remote process to get data from that process. The call can be blocking or non-blocking. If blocking, the call returns after the data has arrived from the remote process and has been written to the buffer you supply. If non-blocking, the call returns after the request has gone out and the user_data buffer can be reused.

In the non-blocking case, a handle is returned representing the slot in the data portal where your buffer is waiting for the incoming data. Test for completion of the operation by supplying the handle to srvr_test_write_buf. When the operation has completed, free the slot for reuse with srvr_delete_buf.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.2 srvr_comm_get_reply

#include "srvr_comm.h"

Function: int srvr_comm_get_reply(control_msg_handle *handle,
void *reply_buf, int reply_len)

handle
The handle obtained by the call to srvr_get_next_control_msg by which the get request was received.
reply_buf
The location of the buffer containing the data to be sent in reply to the get request of the remote node.
reply_len
The length of the reply_buf.
RETURN
0 (zero) on success, -1 on failure.

Upon receiving a get request, a process can send the requested data back to the originator by calling srvr_comm_get_reply. When the call returns, the reply buffer may be reused.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.3 srvr_comm_get_reply_partial

#include "srvr_comm.h"

Function: int srvr_comm_get_reply_partial(control_msg_handle *handle,
void *reply_buf, int reply_len, int offset)

handle
The handle obtained by the call to srvr_get_next_control_msg by which the get request was received.
reply_buf
The location of the buffer containing the data to be sent in reply to the get request of the remote node.
reply_len
The length of the reply_buf.
offset
The offset at which to write the data in the remote process' receive buffer.
RETURN
0 (zero) on success, -1 on failure.

It may be desirable to respond to a get request in packets of a certain block size. Use repeated calls to srvr_comm_get_reply_partial to do this. The initiator's call to srvr_test_write_buf will not indicate completion until the entire get request is satisfied.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

6.4 Example

This example shows a process that sends a get request and a process that replies to it. The processes must be run on separate nodes since they are requesting the same fixed Portal ID. The both must be run as root also since they are requesting fixed Portal IDs and not allowing the system to assign them one.

First the receiver:

 
#include <stdio.h>
#include "ppid.h"
#include "srvr_comm.h"

#define EXAMPLE_GET  0x0505

extern nid_type _my_pnid;
extern ppid_type _my_ppid;

main()
{
int portal_id;
int rc, i;
control_msg_handle handle;
int type, xferLen;

    /*
    **  Set my portal ID - init the server library
    */
    _my_ppid = register_ppid(_my_pcb, PPID_TEST, GID_TEST);

    if (_my_ppid != PPID_TEST){
        log_error("Can not register myself as PPID=%d\n", PPID_TEST);
    }

    if (server_library_init()){
         log_error("Can't init server library");
    }

    /*
    ** Need a control portal to receive the get request.
    */
    portal_id = srvr_init_control_ptl(1);

    if (portal_id == SRVR_INVAL_PTL){
        printf("Can't create control portal (%s)\n",
            CPstrerror(CPerrno));
        exit(-1);
    }

    printf("My physical node ID: %d\n",_my_pnid);
    printf("My portal process ID: %d\n",_my_ppid);
    printf("My control portal index: %d\n",portal_id);
    printf("OK to start sender process.\n");

    SRVR_CLEAR_HANDLE(handle);

    while (1){
        rc = srvr_get_next_control_msg(portal_id, &handle,
                 &type, &xferLen, NULL);

        if (rc == 1) break;

        if (rc == -1){
            printf("Error polling control portal (%s)\n",
                CPstrerror(CPerrno);
            exit(-1);
        }
    }

    if (type != EXAMPLE_GET){
       printf("Unexpected message received\n");
       exit(-1);
    }

    printf("Receiver got a Get request from %d/%d for %d bytes.\n",
                  SRVR_HANDLE_NID(handle),
                  SRVR_HANDLE_PID(handle),
                  xferLen);

    getBuf = (char *)malloc(xferLen);

    if (!getBuf){
        printf("Receiver: malloc problem\n");
        exit(-1);
    }

    for (i=0; i<xferLen; i++){
        getBuf[i] = i%3;
    } 

    rc = srvr_comm_get_reply(&handle, getBuf, xferLen);

    if (rc == -1){
         printf("Receiver: error in get reply (%s)\n",
                      CPstrerror(CPerrno));
    }

    srvr_release_control_ptl(portal_id);

    server_library_done();
}

Now the initiator of the get request:

 
#include <stdio.h>
#include <malloc.h>
#include "ppid.h"
#include "srvr_comm.h"

#define EXAMPLE_GET 0x0505

char getBuf[1024];

main(int argc, char *argv[])
{
int destptl;
nid_type destnid;
ppid_type destpid;
int rc, i, dataptl, handle;
int type, xferLen;

    if (argc < 4){
       printf (
       "Sender needs receiver nid and portal id, in that order\n");
       exit(-1);
    }

    nid = (nid_type)atoi(argv[1]);
    pid = PPID_TEST;
    portal_id = atoi(argv[2]);

    /*
    **  Set my portal ID - init the server library
    */
    _my_ppid = register_ppid(_my_pcb, PPID_TEST, GID_TEST);

    if (_my_ppid != PPID_TEST){
        log_error("Can not register myself as PPID=%d\n", PPID_TEST);
    }

    if (server_library_init()){
         log_error("Can't init server library");
    }

#ifdef NONBLOCKING
    handle = srvr_comm_get_req(getBuf, 1024, 
               EXAMPLE_GET, NULL, 0,
               nid, pid, portal_id,
               NONBLOCKING, 0);

    if (handle == -1){
        printf("Sender: error sending get request to %d/%d/%d (%s)\n",
                nid, pid, portal_id,
                CPstreror(CPerrno));
        exit(-1);
    }

    printf("Sender: Get request has been sent to %d/%d/%d\n",
                       nid, pid, portal_id);

    while ((rc = srvr_test_write_buf(handle)) != 1){

        if (rc == -1){
            printf(
             "Sender: error waiting for get data to come in (%s)\n",
                CPstreror(CPerrno));
            exit(-1);
        }
    }

    srvr_delete_buf(handle);
#else
    rc = srvr_comm_get_req(getBuf, 1024
               EXAMPLE_GET, NULL, 0,
               nid, pid, portal_id,
               BLOCKING, 0);

    if (rc == -1){
        printf("Sender: error sending get request to %d/%d/%d (%s)\n",
                nid, pid, portal_id,
                CPstreror(CPerrno));
        exit(-1);
    }

#endif

    printf("Sender: Get data has come in from remote process.\n");

    for (i=0; i<1024; i++){
        if (getBuf[i] != i%3){
            printf("Sender: Invalid data in get buffer.\n");
            break;
        }
    }

    server_library_done();
}


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7. Group formation and communication

The PCT (Process Control Thread) on Cplant is the process on the compute node that sets up the environment for the application process, creates the process, and catches it's termination. In an effort to make applications load quickly, the server library was enhanced to allow temporary groups to form, perform efficient group communication, and disband. The PCTs use these functions to form a group when loading an application, to fan out the executable image and user environment data, and to vote on matters relating to the application load.

The group function names in the server library are prefaced with dsrvr_, meaning distributed server. They provide the beginning functionality that will enable distributed services on Cplant. What remains is to add functions that will allow members to leave and join the group without disrupting service, and functions that will allow the members to share some global state.

7.1 Group fundamentals  All about groups
7.2 Group formation  Group formation and dissolution
7.3 Group operations  Group communication


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.1 Group fundamentals

This section describes the error codes returned by the group functions, and how to determine which member failed when a global operation fails. It also describes functions to access the table of group members, and to determine a group member's status.

7.1.1 Fault detection  Error codes, strings and further information
7.1.2 Query functions  Getting information about group members
7.1.3 Member status  Member status types


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.1.1 Fault detection

The group functions normally return a condition code that indicates success or one of three types of error:

DSRVR_OK
The function completed successfully.
DSRVR_ERROR
A general error condition, such as invalid function arguments.
DSRVR_RESOURCE_ERROR
An error of the local member, such as insufficient memory on the heap or inconsistent data structures.
DSRVR_EXTERNAL_ERROR
There's nothing wrong with the local member, but a remote member of the group failed to respond to a group operation within a reasonable amount of time, or it sent an invalid message.

If a group membership or communication function returns DSRVR_RESOURCE_ERROR, it is probably time to give up and terminate. If it returns DSRVR_EXTERNAL_ERROR, you may want to report the error, abandon the group, and carry on.

If a function returns DSRVR_EXTERNAL_ERROR, there will be information in the global data structure dsrvr_failInfo describing the failure. In particular, dsrvr_failInfo.last_nid and dsrvr_failInfo.last_pid give the physical node ID and the portal process ID of the member that failed.

These functions may be useful in your fault detection scheme:

#include "srvr_coll.h"

Function: void dsrvr_clear_fail_info (void)

This function clears the global dsrvr_failInfo structure.

Function: char *dsrvr_who_failed ()

This function returns a string describing the failure in detail. (The string is invalid after the next call to dsrvr_who_failed.)

See also the functions memberBizarreOnCollectiveOp and memberTimedOutOnCollectiveOp in the discussion of member status.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.1.2 Query functions

#include "srvr_coll.h"

Function: int memberNidByRank (int rank)

Function: int memberPidByRank (int rank)

Function: int memberRankByNidPid (int nid, int pid)

These functions return the physical node ID, the portal process ID, and the rank of the specified member respectively. They return DSRVR_ERROR on error.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.1.3 Member status

The model for membership is that all members join at the start, provide a service, and then all members disband. If a member fails at some point, the group must dispand.

A future enhancement of the membership functions of the Cplant server library will include groups where members can leave and join. (This will enable distributed services, like a bebopd node allocator that is composed of cooperating bebopds on different service nodes.)

In anticipation of this, functions to set and get member status were added.

The status names (defined in an enumerator) are:

activeMember
A normal, active member of the group.
inActiveMember
A former member that has left the group.
timedOutOnCollectiveOp
A member of the group that failed to complete a global operation within a timeout value.
bizarreOnCollectiveOp
A member of the group that sent an invalid message during a global operation.
joining
A member in the process of joining the group, but not committed yet.
leaving
A member in the process of leaving the group, but not committed yet.

The related functions are:

#include "srvr_coll.h"

Function: void memberTimedOutOnCollectiveOp (int groupRank)

This function may be called when a group member fails to perform a global operation within a time limit. It writes a message to the log file, and sets the member's status to timedOutOnCollectiveOp.

Function: void memberBizarreOnCollectiveOp (int groupRank)

This function may be called when a group member sends invalid data. It writes a message to the log file, and sets the member's status to bizarreOnCollectiveOp.

Function: int memberStatusbyRank (int rank)

This function returns the status of the member identified by rank. The statuses are listed above.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.2 Group formation

This section contains the functions that allow processes to form a group, and then disband the group.

The best example of the use of group functions is the PCT source code. See pct_group.c for an example of group formation at work.

7.2.1 dsrvr_member_init  Begin to form a new group
7.2.2 dsrvr_member_add  Add a member to the new group
7.2.3 dsrvr_membership_commit  Finalize group formation
7.2.4 dsrvr_member_done  Dissolve the group


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.2.1 dsrvr_member_init

#include "srvr_coll.h"

Function: int dsrvr_member_init (int maxm, int myRank, int groupId)

maxm
The number of members in the group.
myRank
My rank in the group, ranks range from 0 to maxm - 1.
groupId
An identifier for the group. The PCTs use the Cplant job ID for the job they are hosting.
RETURN
DSRVR_OK on success, DSRVR_RESOURCE_ERROR or DSRVR_ERROR on failure.

This function initializes the group member functions. A process can be a member of only one group at a time. Do not call dsrvr_member_init again until you have disbanded the present group.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.2.2 dsrvr_member_add

#include "srvr_coll.h"

Function: int dsrvr_add_member (int nid, int pid, int rank)

nid
The physical node ID of a new member of the group.
pid
The portal process ID of a new member of the group.
rank
The rank in the group of the new member.
RETURN
DSRVR_OK on success or DSRVR_ERROR on failure.

This function adds a member to the group. When all members have been added, call dsrvr_membership_commit


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.2.3 dsrvr_membership_commit

#include "srvr_coll.h"

Function: int dsrvr_membership_commit (double timeout)

timeout
A timeout in seconds to wait for the global commit operation to complete.
RETURN
DSRVR_OK on success, or DSRVR_RESOURCE_ERROR DSRVR_EXTERNAL_ERROR, or DSRVR_ERROR on failure.

This function is a barrier that returns when all the members that have been added with dsrvr_member_add have called dsrvr_membership_commit. If it returns with DSRVR_EXTERNAL_ERROR then one or more of the members failed to commit. On successful completion of this call, global communication may commence.

The global structure dsrvr_failInfo contains information about which member failed in the commit.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.2.4 dsrvr_member_done

#include "srvr_coll.h"

Function: void dsrvr_member_done ()

This function disbands the group that was formed with dsrvr_membership_commit. It is not a global operation; it returns after a few data structures are updated. A new group with different members may now be formed.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.3 Group operations

Once a group is formed with dsrvr_membership_commit, the members can engage in scalable group communication. This section describes the functions that perform group operations.

7.3.1 dsrvr_barrier  A barrier
7.3.2 dsrvr_gather  Fan in contributions to rank 0 process
7.3.3 dsrvr_bcast  Fan out data from rank 0 process
7.3.4 dsrvr_vote  Group vote operation


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.3.1 dsrvr_barrier

#include "srvr_coll.h"

Function: int dsrvr_barrier (double tmout, int *list, int listLen)

tmout
The time in seconds that may elapse before the uncompleted barrier is considered unsuccessful. A timeout of 0.0 means wait indefinitely.
list
The list of member ranks to include in the barrier. NULL means all members of the group.
listLen
The number of ranks listed in list
RETURN
DSRVR_OK on success, or DSRVR_RESOURCE_ERROR DSRVR_EXTERNAL_ERROR, or DSRVR_ERROR on failure.

This function performs a barrier between all the group members listed in list, or all the members of the group if list is NULL. If the barrier has not completed before tmout seconds have elapsed, the function returns with DSRVR_EXTERNAL_ERROR, and the structure described in the Fault detection section may be accessed for more information.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.3.2 dsrvr_gather

#include "srvr_coll.h"

Function: int dsrvr_gather (char *data, int blklen,
int nblks, double tmout, int type, int *list, int listLen)

data
A buffer large enough to contain each members' field. Caller's contribution is in his block in the data buffer.
blklen
The length in bytes of each members' field.
nblks
The number of members' in the gather operation.
tmout
A timeout in seconds after which an uncompleted gather operation will be considered to have failed. (0.0 means wait indefinitely.)
type
A message type to identify the gather operation.
list
A list of the ranks of the members participating in the gather operation. NULL pointer implies all members participate.
listLen
The number of members listed in the list.
RETURN
DSRVR_OK on success, or DSRVR_RESOURCE_ERROR DSRVR_EXTERNAL_ERROR, or DSRVR_ERROR on failure.

This function performs a gather operation for the members listed in list, or all members of the group is list is NULL. When calling the function, the caller's block should be set in the data buffer. On completion of the call, all member's contributions are in the data buffer on the root node. (The root node is the rank 0 node if list is NULL, or it is the first rank in the list.)

You won't like this, but in the Portals 2 implementation, each member's contribution must have at least one non-zero bit. Otherwise the function can not tell when all member's contributions have arrived. (If you are gathering integers, and zero may be a valid contribution, gather pairs of integers instead, where one integer is always set to a non-zero value.)

If the gather operation has not completed before tmout seconds have elapsed, the function returns with DSRVR_EXTERNAL_ERROR, and the structure described in the Fault detection section may be accessed for more information.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.3.3 dsrvr_bcast

#include "srvr_coll.h"

Function: int dsrvr_bcast (char *buf, int len, double tmout,
int type, int *list, int listLen)

buf
The location of the buffer to be broadcast.
len
The length in bytes of buf.
tmout
A timeout in seconds after which the uncompleted broadcast will be considered to have failed. (0.0 means wait indefinitely.)
type
A message type associated with the broadcast.
list
A list of the ranks of members participating in the broadcast. If this is NULL, then all members of the group participate.
listLen
The number of members listed in list.
RETURN
DSRVR_OK on success, or DSRVR_RESOURCE_ERROR DSRVR_EXTERNAL_ERROR, or DSRVR_ERROR on failure.

This function broadcasts the contents of buf from the root member out to the rest of the members. (The root node is the rank 0 node if list is NULL, or it is the first rank in the list.)

If the broadcast has not completed before tmout seconds have elapsed, the function returns with DSRVR_EXTERNAL_ERROR, and the structure described in the Fault detection section may be accessed for more information.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

7.3.4 dsrvr_vote

#include "srvr_coll.h"

Function: int dsrvr_vote (int voteVal, double tmout, int type, int *list, int listLen)

voteVal
The caller's vote.
tmout
A timeout in seconds after which the uncompleted vote will be considered to have failed. (0.0 means wait indefinitely.)
type
A message type associated with the vote.
list
The list of the ranks of the members participating in the vote. If NULL, the all group members are voting.
listLen
The number of ranks in the list.
RETURN
DSRVR_OK on success, or DSRVR_RESOURCE_ERROR DSRVR_EXTERNAL_ERROR, or DSRVR_ERROR on failure.

This function tallies a vote from all group members listed in list, or from all members in the group if list is NULL.

Upon completion of the call, all members can access the votes through the macro DSRVR_VOTE_VALUE(rank). The votes disappear at the next call to dsrvr_vote.

If the vote has not completed before tmout seconds have elapsed, the function returns with DSRVR_EXTERNAL_ERROR, and the structure described in the Fault detection section may be accessed for more information.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

8. Errors and logging

This is a catchall section bringing together all aspects of the library that deal with error detection, error handling and error reporting.

See the section on Fault detection for group operations to learn more about error reporting in the group functions.

8.1 CPerrno global error value  The Cplant global error value
8.2 Error handling  logging errors


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

8.1 CPerrno global error value

Most server library functions return 0 (zero) on success and -1 to indicate an error. When an error condition is returned, the routine has usually set the global CPerrno to a value that indicates the cause of the error. These are the values set by the library:

EINVAL
invalid argument
EPORTAL
portal misbehavior
ELOCK
failed memory lock
EUNLOCK
failed memory unlock
ENOMEM
out of memory
ESENDTIMEOUT
timeout on send flag
ERECVTIMEOUT
timeout on message wait
EMLSTFULL
full match list
EPTLTRAP
portal action trap failure
EPTLTIMEOUT
timeout waiting on portal state
EOHHELL
library internal error
ERESOURCE
out of resource in library
ECORRUPT
incoming message is corrupt

In the source code repository, the errors are defined in `/top/include/portals/puma_errno.h'. The strings describing the errors are defined in `/top/lib/puma/clib/pumaerr.c'.

The name CPerrno indicates that this value is set to the error encountered by the Cplant library level. It is analygous to the errno value set by the underlying operating system's libraries.

Function: char *CPstrerror (int errnum)

This function returns a pointer to a string describing the error.


[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

8.2 Error handling

The Cplant server library provides a logging facility through syslog. The library logs exceptions to `/var/log/cplant' (assuming the system administrator set the local7 facility to go to `/var/log/cplant'). Any code linked with the library may log there as well. If CPerrno is set, the logging routines will log the error string associated with it.

Function: void log_open (const char *myidentity)

my identity
A string that will prepend log entries.

This function sets up the string that will appear before all your log entries. If you don't call this function, the library's logging will appear with some generic string.

Function: void log_reopen (const char *myidentity)
myidentity
A string that will prepend log entries.

This function closes and reopens the log file.

Function: void log_to_file (int on)
on
1 to log to file, 0 to stop logging to file

By default, all the log_* functions described below log to the file `/var/log/cplant'. Call this function with 0 (zero) to turn off logging to the file.

Function: void log_to_stderr (int on)
on
1 to log to stderr, 0 to stop logging to stderr

By default, the log_* functions described below do not log to stderr Call this function with 1 (one) to turn on logging to stderr.

Function: void log_warning (const char *fmt, ...)
fmt
printf style string with optional conversion specifications
...
arguments for the conversion specifications in fmt

This functions logs the provided string, plus the errno and CPerrno values if set, with priority LOG_WARNING. Then it returns to the caller.

Function: void log_error (const char *fmt, ...)

fmt
printf style string with optional conversion specifications
...
arguments for the conversion specifications in fmt

This functions logs the provided string, plus the errno and CPerrno values if set, with priority LOG_ERR. Then it exits.

Function: void log_msg (const char *fmt, ...)
prints CPerrno if set and returns, priority LOG_WARNING

fmt
printf style string with optional conversion specifications
...
arguments for the conversion specifications in fmt

This functions logs the provided string, plus the CPerrno value if set, with priority LOG_WARNING. Then it returns to the caller.

Function: void log_quit (const char *fmt, ...)

fmt
printf style string with optional conversion specifications
...
arguments for the conversion specifications in fmt

This functions logs the provided string, plus the CPerrno value if set, with priority LOG_ERR. Then it exits.


[Top] [Contents] [Index] [ ? ]

Table of Contents


[Top] [Contents] [Index] [ ? ]

Short Table of Contents

1. Overview
2. Initialization
3. Control portal functions
4. Data portal functions
5. Put messages
6. Get messages
7. Group formation and communication
8. Errors and logging

[Top] [Contents] [Index] [ ? ]

About this document

This document was generated by Lee Ann Fisk on May, 9 2001 using texi2html

The buttons in the navigation panels have the following meaning:

Button Name Go to From 1.2.3 go to
[ < ] Back previous section in reading order 1.2.2
[ > ] Forward next section in reading order 1.2.4
[ << ] FastBack previous or up-and-previous section 1.1
[ Up ] Up up section 1.2
[ >> ] FastForward next or up-and-next section 1.3
[Top] Top cover (top) of document  
[Contents] Contents table of contents  
[Index] Index concept index  
[ ? ] About this page  

where the Example assumes that the current position is at Subsubsection One-Two-Three of a document of the following structure:

This document was generated by Lee Ann Fisk on May, 9 2001 using texi2html