MPICH 1.2.0 over Portals 3.1
This document describes the port of MPICH 1.2.0 on the Portals 3.1
message passing layer for use on Cplant (1.0 and higher).
Protocols
In order to optimmize the latency performance of small messages, the
implementation uses a two-level protocol. The following describes the
different steps involved in the short message and long message
protocols.
Definitions
- unexpected message
-
a message for which no matching receive buffer has been posted by the
application
Short Protocol
The short message protocol uses an eager send strategy and buffers
unexpected messages at the receiving process. The sender sends the
user data to the receiver, and the send completes as soon as the
entire buffer has been delivered to the network. At the receive side,
the queue of posted receives will be traversed in order to find a
matching receive. If a match is found, the incoming message is
deposited directly into the application's buffer. If the message is
unexpected, it is deposited into one of several short "catchall"
buffers that the MPI library has previously allocated. The MPI
library allocates a fixed number of these buffers at initialization
time. When a matching receive for this message is eventually posted,
the message is copied out of the MPI buffer into the user-supplied
buffer. The MPI buffer is then added back into the list of unexpected
buffers.
Long Protocol
The long message protocol also uses an eager send strategy where the
user data is sent immediately. However, rather than buffering
unexpected messages at the receiver, unexpected long messages are
buffered "in place" at the sender. Before initiating the send, the
sending process makes the send buffer available for reading by the
receiver. The sender sends the data to the receiver and wants for an
network acknowledgment from the receiver. If a matching buffer is
pre-posted, the entire message is deposited into the user buffer, and
a network acknowledgment is sent informing the sender that the entire
message was received. If the incoming long message is unexpected, it
will match a long "catchall" zero-length buffer that will record
information about the incoming message and dump the incoming data. A
network acknowledgment is sent informing the sender that the message
was not received. At this point, the sender will wait for the
receiver to read the message, after which the send operation is
complete. When a matching receive is eventually posted, the receiver
will retrieve the recorded information left by the incoming send, and
read the message from the sender directly into the user's buffer.
Synchronous Protocol
The short protcol described above is used for standard mode sends.
For synchronous mode sends (MPI_Ssend()), the send operation does not
complete until a matching receive operation has started. For short
messages, the data is again sent eagerly, but the send does not
complete until an acknowledgment is received. If a matching buffer is
pre-posted at the receiver, the incoming message is deposited directly
into the user buffer, and a network acknowledgment is sent infoming
the sender that the entire buffer was received. At this point, both
the send and the receive have completed. If the incoming message is
unexpected, the message is again deposited into one of the
MPI-supplied unexpected message buffers. A network acknowledgment is
not sent for messages deposited into these unexpected message
buffers. When a matching receive is eventually posted, the receiver
sends an acknowledgment message back to the sender and copies the
message into the user-supplied buffer.
Since the long message protocol meets the semantics of a synchronous
send, the long synchronous protocol is identical to the long protocol.
Ready Protocol
In the ready send mode (MPI_Rsend()), the application programmer
guarantees to the MPI library that a receive has been pre-posted at
the receiver. For short messages, the protocol used is the same as a
standard mode send. For long messages, the data is sent without first
setting up the buffer to be read by the receiver. That is, the short
ready send and the long ready send both send the data eagerly, and the
send operation is complete as soon as the data has been delivered to
the network. At the receive side, the entire message is deposited
directly into the user buffer. If the message is unexpected, which is
a violation of the standard, the entire message is dropped, and the
library will eventually abort with an internal unrecoverable error.
Buffered Protocol
In addition to standard, synchronous, and ready send modes, the
standard also defines a buffered mode. Buffered mode sends are by
nature low performance and should be avoided.
Collective Operations
The collective operations are currently implemented using the default
MPICH collective operations, which are built on top of the underlying
MPI peer communications. An optimized collective operation library
directly on Portals 3.1 is under development.
User-configurable settings
Several of the resources that the MPI library allocates at
initialization time can be configured by the user through envrionment
variables that are set in the shell in which yod is invoked. The
following describes these values and their effect.
- MPI_UNEX_BLOCK_SIZE
-
The value of this environment variable controls how much space is
allocated to hold unexpected messages. The current strategy uses
three fixed-sized blocks of memory to buffer unexpected messages.
The default value is 2 MB (2097152 bytes), resulting in a total
of 6 MB of space.
- MPI_UNEX_MAX
-
The value of this environment variable controls the maximum number of
unexpected messages (both long and short) that can be outstanding at
any given time. The default value is 2048.
- MPI_IRECV_MAX
-
The value of this environment variable controls the maximum number of
outstanding posted receives that can be in progress at one time. Each
MPI_Irecv() operation consumes a resource and only a finite number of
these resources are available from Portals 3.1. The default value is
1024.
- MPI_LONG_MAX
-
The value of this environment variable controls the maximum nubmer of
oustanding long send operations that can be in progress at one time.
As with posted receives, each outstanding long send operation consumes
a resource. The default value is 1024.
- MPI_LONG_MSG
-
The value of this environment variable controls the definition of
short and long messages. A short message is less than this value,
while a long message is greater than or equal to this value. Setting
this value allows the user to increase the short/long protocol
switchover point. The default value is 8064.
-
Debugging
There are also environment variables that can be set to extract
debugging information from MPI programs, provided the MPI library has
been compiled with these debugging options enabled.
- MPI_DEBUG
-
Setting this environment variable will cause all processes in the job
to print low-level MPI implementation information to standard output.
For large jobs or jobs that do a large amount of message passing, this
output can be overwhelming.
- MPI_DUMP_QUEUES
-
Setting this environment variable will cause all processes in the job
to install a signal hander for the SIGUSR2 signal. This signal
handler will dump the current state of the unexpected message queue
from each process to standard output. While a job is running, a
SIGUSR2 signal sent to the yod process will be propogated to each
process in the job. Upon receiving this signal, each process will
print its unexpected queue.