MPICH 1.2.0 over Portals 3.1

This document describes the port of MPICH 1.2.0 on the Portals 3.1 message passing layer for use on Cplant (1.0 and higher).

Protocols

In order to optimmize the latency performance of small messages, the implementation uses a two-level protocol. The following describes the different steps involved in the short message and long message protocols.

Definitions

unexpected message
a message for which no matching receive buffer has been posted by the application

Short Protocol

The short message protocol uses an eager send strategy and buffers unexpected messages at the receiving process. The sender sends the user data to the receiver, and the send completes as soon as the entire buffer has been delivered to the network. At the receive side, the queue of posted receives will be traversed in order to find a matching receive. If a match is found, the incoming message is deposited directly into the application's buffer. If the message is unexpected, it is deposited into one of several short "catchall" buffers that the MPI library has previously allocated. The MPI library allocates a fixed number of these buffers at initialization time. When a matching receive for this message is eventually posted, the message is copied out of the MPI buffer into the user-supplied buffer. The MPI buffer is then added back into the list of unexpected buffers.

Long Protocol

The long message protocol also uses an eager send strategy where the user data is sent immediately. However, rather than buffering unexpected messages at the receiver, unexpected long messages are buffered "in place" at the sender. Before initiating the send, the sending process makes the send buffer available for reading by the receiver. The sender sends the data to the receiver and wants for an network acknowledgment from the receiver. If a matching buffer is pre-posted, the entire message is deposited into the user buffer, and a network acknowledgment is sent informing the sender that the entire message was received. If the incoming long message is unexpected, it will match a long "catchall" zero-length buffer that will record information about the incoming message and dump the incoming data. A network acknowledgment is sent informing the sender that the message was not received. At this point, the sender will wait for the receiver to read the message, after which the send operation is complete. When a matching receive is eventually posted, the receiver will retrieve the recorded information left by the incoming send, and read the message from the sender directly into the user's buffer.

Synchronous Protocol

The short protcol described above is used for standard mode sends. For synchronous mode sends (MPI_Ssend()), the send operation does not complete until a matching receive operation has started. For short messages, the data is again sent eagerly, but the send does not complete until an acknowledgment is received. If a matching buffer is pre-posted at the receiver, the incoming message is deposited directly into the user buffer, and a network acknowledgment is sent infoming the sender that the entire buffer was received. At this point, both the send and the receive have completed. If the incoming message is unexpected, the message is again deposited into one of the MPI-supplied unexpected message buffers. A network acknowledgment is not sent for messages deposited into these unexpected message buffers. When a matching receive is eventually posted, the receiver sends an acknowledgment message back to the sender and copies the message into the user-supplied buffer. Since the long message protocol meets the semantics of a synchronous send, the long synchronous protocol is identical to the long protocol.

Ready Protocol

In the ready send mode (MPI_Rsend()), the application programmer guarantees to the MPI library that a receive has been pre-posted at the receiver. For short messages, the protocol used is the same as a standard mode send. For long messages, the data is sent without first setting up the buffer to be read by the receiver. That is, the short ready send and the long ready send both send the data eagerly, and the send operation is complete as soon as the data has been delivered to the network. At the receive side, the entire message is deposited directly into the user buffer. If the message is unexpected, which is a violation of the standard, the entire message is dropped, and the library will eventually abort with an internal unrecoverable error.

Buffered Protocol

In addition to standard, synchronous, and ready send modes, the standard also defines a buffered mode. Buffered mode sends are by nature low performance and should be avoided.

Collective Operations

The collective operations are currently implemented using the default MPICH collective operations, which are built on top of the underlying MPI peer communications. An optimized collective operation library directly on Portals 3.1 is under development.

User-configurable settings

Several of the resources that the MPI library allocates at initialization time can be configured by the user through envrionment variables that are set in the shell in which yod is invoked. The following describes these values and their effect.
MPI_UNEX_BLOCK_SIZE
The value of this environment variable controls how much space is allocated to hold unexpected messages. The current strategy uses three fixed-sized blocks of memory to buffer unexpected messages. The default value is 2 MB (2097152 bytes), resulting in a total of 6 MB of space.
MPI_UNEX_MAX
The value of this environment variable controls the maximum number of unexpected messages (both long and short) that can be outstanding at any given time. The default value is 2048.
MPI_IRECV_MAX
The value of this environment variable controls the maximum number of outstanding posted receives that can be in progress at one time. Each MPI_Irecv() operation consumes a resource and only a finite number of these resources are available from Portals 3.1. The default value is 1024.
MPI_LONG_MAX
The value of this environment variable controls the maximum nubmer of oustanding long send operations that can be in progress at one time. As with posted receives, each outstanding long send operation consumes a resource. The default value is 1024.
MPI_LONG_MSG
The value of this environment variable controls the definition of short and long messages. A short message is less than this value, while a long message is greater than or equal to this value. Setting this value allows the user to increase the short/long protocol switchover point. The default value is 8064.

Debugging

There are also environment variables that can be set to extract debugging information from MPI programs, provided the MPI library has been compiled with these debugging options enabled.
MPI_DEBUG
Setting this environment variable will cause all processes in the job to print low-level MPI implementation information to standard output. For large jobs or jobs that do a large amount of message passing, this output can be overwhelming.
MPI_DUMP_QUEUES
Setting this environment variable will cause all processes in the job to install a signal hander for the SIGUSR2 signal. This signal handler will dump the current state of the unexpected message queue from each process to standard output. While a job is running, a SIGUSR2 signal sent to the yod process will be propogated to each process in the job. Upon receiving this signal, each process will print its unexpected queue.