Purpose
Existing message passing technologies availability for commodity cluster networking hardware do not meet the scalability goals required by the Cplant[1] project at Sandia National Laboratories. The goal of the Cplant project is to construct a commodity cluster that can scale to the order of ten thousand nodes. Their number greatly exceeds the capacity for which message passing technologies have been designed and implemented.
In addition to the scalability requirements of the network, these technologies must also be able to support a scalable implementation of Message Passing Interface (MPI)[7] standard, which has become the de facto standard for parallel scientific computing. While MPI does not impose any scalability limitations, existing message passing technologies do not provide the functionality needed to allow implementations of MPI to meet the scalability requirements of Cplant.
The following are properties of a network architecture that do not
impose any inherent scalability limitations:
- Connectionless - Many connection-oriented architectures, such as VIA[3] and TCP/IP sockets, have limitations on the number of peer connections that can be established.
- Network independence - Many communication systems depend on the host processor to perform operations in order for messages in the network to be consumed. Message consumption from the network should not be dependent on host processor activity, such as the operating system scheduler or user-level thread scheduler.
- User-level flow control - Many communication systems manage flow control internally to avoid depleting resources, which can significantly impact performance as the number of communicating processes increases.
- OS Bypass - High performance network communication should not involve memory copies into or out of a kernel-managed protocol stack.
The following are properties of a network architecture that do not impose scalability limitations for an implementation of MPI:
- Receiver-managed - Sender-managed message passing implementations require a persistent block of memory to be available for every process, requiring memory resources to increase with job size and requiring user-level flow control mechanisms to manage these resources.
- User-level Bypass - While OS Bypass is necessary for high-performance, it alone is not sufficient to support the Progress Rule of MPI asynchronous operations.
- Unexpected messages - Few communication systems have support for receiving messages for which there is no prior notification. Support for these types of messages is necessary to avoid flow control and protocol overhead.