Karen D. Devine, Gary L. Hennigan*, Scott A. Hutchinson, Andrew G. Salinger, John N. Shadid, and Ray S. Tuminaro
jnshadi@cs.sandia.gov
http://www.cs.sandia.gov/CRF/mpsalsa.html
Current state-of-the-art chemically reacting flow codes use either complex fluid dynamics with simple chemical reaction mechanisms or complex chemical kinetics with simple fluid mechanics models. This unfortunate dichotomy in resolution is due to the tremendous computational resources needed to solve large, real-world chemically reacting flow problems in complex flow geometries. As a result, the use of 3D computational simulations is very difficult and the computational design optimization of such systems has been impossible. In a typical reaction mechanism, there can be over thirty important chemical species undergoing more than fifty reactions. Our results demonstrate that large parallel machines can provide the performance and memory necessary to solve such chemically reacting flow problems in 3D with an equal emphasis on flow and reaction kinetics. However, achieving this performance requires addressing specific difficulties associated with unstructured finite element (FE) implementations on distributed memory computers. To date, MPSalsa [9] has been used for production computational analysis of CVD reactors, multi-phase porous media flow, low Mach number compressible neutral-gas simulations for an ion source tube and simulation of alpha particle detectors for nuclear nonproliferation studies.
The Sandia-Intel TFlop computer is a general purpose massively parallel processor (MPP) with 3600 (eventually 4500) shared memory compute nodes where each node is comprised of dual 200 MHz Intel Pentium Pro processors and 128 MBytes of memory. This machine set the MP-Linpack world record of 1.05 Tflop/s in December 1996. In contrast, MPSalsa is a real, complex, production scientific code with a total over 150,000 lines of C source code and a User's Guide [7]. Achieving high performance on such a large application code on a MPP is very challenging. In this brief summary we discuss two such challenges: the single CPU performance of the underlying kernel algorithms and the use of multiple shared memory CPUs at the nodes of the parallel machine. In particular, the choice of data structures that make efficient use of the memory hierarchy and data throughput throughout the code is critical to overall performance. The ideal structure combines efficient memory usage with high computational throughput and allows implementation of robust solution methods. These issues have been addressed in MPSalsa, enabling unprecedented performance in the solution of implicit unstructured FE problems.
MPSalsa computes the solution of the conservation equations for momentum, total mass, thermal energy, and individual gas and surface phase chemical species for low Mach number flows. As Table 1 shows, these equations form a complex set of coupled, nonlinear PDEs. Necessary transport properties, diffusion coefficients, kinetic rate constants and diffusion velocities are obtained from the CHEMKIN [5, 6] library.
Momentum | |
Total Mass | |
Thermal Energy | |
Species Mass Fraction for Species k | |
The continuous problem defined by the governing conservation equations is spatially approximated by a Petrov-Galerkin finite element method. For steady-state simulations, the nonlinear PDEs are solved using an inexact Newton method with back-tracking [11]. For transient simulations, the spatial approximation is coupled with first- and second-order dynamically controlled time stepping methods. At each time step, an implicit solution of the nonlinear PDEs is obtained by the inexact Newton method. In both cases, the resulting linear systems are solved iteratively using preconditioned Krylov techniques as implemented in the Aztec library [4].
MPSalsa stores fully summed equations in a sparse matrix format that allows a minimal floating-point-operation solution. This approach sums the elemental coefficient matrices at the beginning of the linear solve rather than summing each matrix vector product and hence eliminates redundant computations. However, while efficient in both memory usage and performance during the linear solution phase of MPSalsa, the indirection associated with summing the element integration values directly into this data structure is costly. Thus, we have implemented element-based data structures for use during the matrix-fill algorithm. These data structures are contiguous in memory and can be addressed directly. All of an element's contributions to the Jacobian matrix are stored in the element-based data structure as they are computed, and are accumulated into the sparse, unstructured Jacobian matrix at the end of the computations for the element. In this way, MPSalsa can efficiently fill the sparse, unstructured Jacobian matrix at a high performance rate. These data structures also make use of the second processor on each node during key kernels of the matrix fill algorithm.
The use of the fully summed Jacobian matrix allows the design of more complex and robust preconditioning methods. The matrix is stored in a variable block row (VBR) format [1]. In this approach the matrix has a sparse block structure due to the coupling of the physics at each FE node. These blocks are stored contiguously in memory so that the indirect addressing of other sparse matrix formats is replaced with directly addressed dense matrix vector multiplications, e.g. those in level-2 BLAS.
The parallel Krylov algorithms implemented in Aztec include CG, CGS, GMRES, CGSTAB and TFQMR. The available preconditioners are row sum and block Jacobi scaling, block Jacobi preconditioning, Neumann series and least-squares polynomial methods, and many additive Schwarz domain decomposition preconditioners using various ILU factorizations with variable overlap. The main kernels of the iterative methods are matrix-vector products, DAXPY type operations and vector inner products. The key to performance in these solvers is the efficiency of the matrix-vector multiplication kernel, where interprocessor communication time must be minimized. In forming the sparse matrix and vectors, each processor is assigned a set of unknowns corresponding to a set of rows in the sparse matrix and associated vectors [8]. This set is further subdivided into border and internal unknowns. Border unknowns are those which must communicate with neighboring processors to complete the matrix-vector multiplication; the remaining unknowns owned by a processor are designated as internal. Those unknowns required for a processor's computations but assigned to a neighboring processor are designated external. This division of nodes into internal, border and external sets allows computation and communication to be overlapped; since calculations on the internal nodes require no updated values, their computations can proceed simultaneously with communication.
A preprocessing step in many parallel simulations is partitioning the domain into subdomains and assigning the subdomains to processors. The partitions used for the results presented in Section 4 were generated using Chaco [3], a general graph (or mesh) partitioning tool which supports a variety of new and established graph partitioning heuristics. These include spectral techniques, geometric methods, multilevel algorithms and the Kernighan-Lin method. Using these techniques, a problem mapping was constructed with low communication volume, good load balance, few message start-ups and little congestion, which also facilitated the overlapping of communication and computation.
Our discussion of performance starts with the effect of the VBR data structure block size on performance of the iterative solver kernel on a single node. In MPSalsa, the VBR block size is determined from the total number of unknown quantities solved for at each FE node. For 3D fluid flow, energy and mass transfer, there are three fluid velocities u_{i}, the hydrodynamic pressure P, the temperature T, and N_{s} gas-phase species mass fractions Y_{k}, totaling (5 + N_{s}) unknowns at each FE node. In Figure 1, we show the increase in performance for the TFQMR iterative solver with increasing block size. This increase in performance with VBR block size is due to a reduction in the number of indirect addressing operations per floating point operation in the matrix-vector multiplication step.
To demonstrate the breadth of MPSalsa's robust and efficient algorithms, we show in Table 2 the time to solution for a steady state calculation of a one-million-element tilted CVD reactor with more than eight million unknowns. The reactor configuration is presented in Figure 2, along with the surface elements of a simplified 43,568 element mesh. The reactants enter the reactor through the rectangular region on the left. The bottom of the reactor slopes up at a nine-degree angle so that the outlet of the reactor has only one quarter the area of the inlet face. The rectangular susceptor has a disk in its middle that spins to produce more uniform deposition. The tilted base of the reactor is designed to offset the decline in deposition rate down the length of the reactor due to reactant consumption by accelerating the flow and shrinking the boundary layer over the disk, thereby decreasing mass transfer resistance to the surface. The robust methods in MPSalsa allow direct calculation of the steady state without the need to time step to the solution. A solution for this reactor is shown in Figure 3. MPSalsa has also been used to computationally optimize this reactor using a conjugate gradient optimization routine, entailing 100 solutions [2].
# TFlop nodes | Time to Load Mesh | Output Time | Newton Steps | Aztec Iterations | Matrix Fill Time | Solver Time | Overall Time |
1000 | 260 | 44 | 17 | 9834 | 360.0 | 7605.0 | 8269.0 |
MPSalsa has also been used to simulate a vertical CVD reactor used for Gallium Arsenide (GaAs) deposition. The reactor geometry consists of a vertical cylinder sitting concentrically inside a larger cylindrical reaction vessel. Flow enters uniformly through the circular cross-section of the reactor. The inner cylinder is rotated with reaction occurring on its top heated surface. Flow exits through the annular region between the cylinders. Very uniform growth has been observed in this reactor over a large central section of the disk where the effects of a finite radius system are small. Figure 4 shows a coarse grid steady state solution of a rotating disk reactor and Figure 5 shows how these solutions match experimental results.
For these reactors, experimentalists have also seen transient behavior, which is much more computationally demanding to simulate. Visualizations of transient simulations are shown in Figure 6 and Figure 7. Table 3 shows the scaling performance of MPSalsa for such a simulation. A mechanism with 19 species yields a block size of 24. A half-million node mesh was used on 900 nodes of the Sandia-Intel TFlop computer, while a two million node mesh was used on the full 3600 nodes, attaining a 97.7% relative scaled efficiency. Clearly, both the matrix fill and iterative solver are highly scalable. In addition, use of the second processor on each compute node reduced the matrix fill time by 22% and the solve time by 12%.
Table 4 shows the performance of MPSalsa on a transient run using the TFQMR solver, which balances robustness with memory efficiency. As discussed earlier, the solver speed increases with block size, reaching a peak rate of 226 Gflop/s. The 8x8 block size run used a simple Gallium Arsenide mechanism, which attained a very fast fill rate, but the solver rate suffered due to the small block size. The Gallium Arsenide deposition problem with the full 19 species mechanism, run on a 2 million node mesh, attained a sustained rate of 211 Gflop/s over the entire code: matrix fill, linear solver, and time to output the solution. The use of the dual processor gained approximately 15% overall. The 36x36 block size includes 31 species for the combustion of methane, and several hundred reactions.
Mesh Size | # Unknowns | # Tflop Nodes | # CPUs per Node | Matrix Fill | Solver | Overall | Scaled Efficiency | |||
Time | Gflop/s | Time | Gflop/s | Time | Gflop/s | |||||
2.0x10^{6} | 4.8x10^{7} | 3600 | 2 | 36.6 | 110.8 | 236.6 | 205.5 | 273.2 | 192.4 | 97.7% |
0.5x10^{6} | 1.2x10^{7} | 900 | 2 | 36.1 | 27.3 | 229.2 | 52.6 | 265.3 | 49.2 | |
1 | 44.1 | 22.3 | 251.0 | 48.0 | 295.1 | 44.2 |
Mesh Size | Block Size | # Unknowns | I/O Time (amortized over 5 time steps) | # CPUs Per Node | Matrix Fill | Solver | Overall | |||
Time | Gflop/s | Time | Gflop/s | Time | Gflop/s | |||||
2.0x10^{6} | 8x8 | 1.6x10^{7} | 0.9 | 2 | 3.4 | 122.6 | 66.2 | 130.2 | 70.5 | 128.1 |
1 | 4.2 | 99.0 | 67.5 | 127.0 | 72.6 | 123.8 | ||||
2.0x10^{6} | 24x24 | 4.8x10^{7} | 2.6 | 2 | 36.6 | 110.8 | 343.9 | 224.0 | 383.1 | 211.7 |
1 | 45.0 | 87.5 | 392.0 | 196.3 | 439.6 | 183.9 | ||||
0.75x10^{6} | 36x36 | 2.4x10^{7} | 1.5 | 2 | 91.1 | ~88.7 | 280.8 | 226.1 | 373.4 | ~191.6 |
1 | 98.5 | ~82.0 | 314.7 | 201.8 | 414.4 | ~172.8 |
We have developed a parallel code to simulate chemically reacting flows on complex geometries with the specific goal of modeling chemically reacting flow systems, such as chemical vapor deposition reactors. The code solves complex 3D flow, heat transfer, mass transfer and nonequilibrium chemical reaction equations on unstructured grids. Through a combination of innovative algorithms and data structures, this production scientific code has sustained over 210 Gflop/s on 3600 nodes of the Sandia-Intel TFlop computer while simulating a problem of great interest to Sandia and the semiconductor industry.
The authors gratefully acknowledge Greg Henry at Intel for providing key assembly-coded routines.