Zoltan will play an important role on CPlant and other heterogeneous computing architectures. Since the characteristics of the processors available to an application are not known until the processors are assigned, traditional static partitioning (as in Chaco/nem_slice) is not sufficient. A dynamic partition that accounts for the obtained processors' computing power, memory, and network connections is needed. Zoltan can be used to compute this dynamic partition.
The heterogeneous computing model we are implementing in Zoltan will enable load-balancing for heterogeneous processors and networks. In the model, a heterogeneous computing system is represented hierarchically, as shown in the figure below. The root of the hierarchy describes the entire computer, including its network topology and speed. Child nodes represent components of the computer. Leaf nodes represent individual processors. In FY99-FY00, we will implement the heterogeneous model, modify existing algorithms to fit the model, and investigate new algorithms that account for processor speeds, network speeds, and available memory.

In the example, a catalytic partial oxidation reactor is modeled. In the figure below, the blue region represents the reactor wall, where only heat transfer is modeled. The yellow region represents the volume of the reactor, where fluid flow, heat transfer, and mass transfer with 22 chemical species and 77 reversible gas-phase reactions (known as the "whole enchilada" in MPSalsa) are modeled. The red region represents a portion of the reactor wall where a catalyst causes surface reactions to take place; fluid flow, heat transfer, mass transfer, and surface reactions are modeled.

The time per finite element node for the matrix fill varies significantly between the three regions. The matrix-fill time per node in the reactor wall is approximately 0.006 seconds. Within the volume of the reactor, it is approximately eight times greater, due to the gas-phase reactions and the greater number of unknowns per node. On the reacting surface, the fill-time per node is roughly two orders of magnitude greater still, due to the non-linear solve required at each reacting surface node. In general, these computation times are not known and, thus, cannot be used by nem_slice, a static partitioning tool used as a pre-processor in MPSalsa. These disparate fill times, however, cause the matrix-fill operation in MPSalsa to be highly imbalanced for a static partition computed by a multi-level graph partitioning algorithm.
We ran MPSalsa on 50 processors of CPlant. During the first matrix fill in MPSalsa, we computed the matrix-fill time for each node. We then called Zoltan to recompute a partition, weighting the nodes by their matrix-fill times. Matrix fills using the new partition required 51% less computation time than those using the original partition. The total load-balancing time in Zoltan was 0.07 seconds. The time for data migration was 0.69 seconds. The sum of these two times is far less that the time saved by using the new partition. For complete results, see the table below.
In MPSalsa, the
implementation of the load-balancing callback functions required by Zoltan
took less than 200 lines of C code, showing that Zoltan is easy to use
by an application. The data migration callback functions required
significantly more code, due to the complexity of the data structures in
MPSalsa. However,
all communication for migrating data (nodes, elements, faces, etc.) between
processors was done by Zoltan's migration-help tools. No additional
communication routines were needed in MPSalsa
to establish the new partition.
Updated: August 29, 1999
Karen Devine
kddevin@cs.sandia.gov