High Performance MP Unstructured Finite Element Simulation of Chemically Reacting Flows

Karen D. Devine, Gary L. Hennigan*, Scott A. Hutchinson, Andrew G. Salinger, John N. Shadid, and Ray S. Tuminaro
Sandia National Laboratories, Albuquerque, NM
*New Mexico State University, Las Cruces, NM
jnshadi@cs.sandia.gov
http://www.cs.sandia.gov/CRF/mpsalsa.html

Abstract:
Many scientific and engineering applications require a detailed analysis of complex systems with strongly coupled fluid flow, thermal energy transfer, mass transfer and non-equilibrium chemical reactions. Examples include combustion research for transportation and energy conversion systems, surface catalytic reactors and the modeling of chemical vapor deposition (CVD) processing for advanced semiconductor materials. Here we describe the performance of MPSalsa, a modeling code designed to simulate these complex flows on large-scale parallel machines. MPSalsa uses 3D unstructured finite element methods, fully implicit time integration and general gas phase and surface species chemical kinetics to model geometrically complex flow systems. The implicit nature of the algorithm requires the solution of a coupled set of nonlinear PDEs on complex domains, a difficult task on the distributed memory architectures of modern large-scale parallel machines. To address this difficulty, we have designed MPSalsa around general kernels for automated problem partitioning, efficient unstructured message passing communication, a distributed sparse-block matrix representation of the fully summed global finite element equations and a parallel preconditioned Krylov iterative solver library (Aztec). Using these techniques we have obtained a sustained rate of over 210 Gflop/s on a 3D chemically reacting flow problem.

Keywords:
chemically reacting flow, unstructured finite element methods, parallel computing

1. Introduction

Current state-of-the-art chemically reacting flow codes use either complex fluid dynamics with simple chemical reaction mechanisms or complex chemical kinetics with simple fluid mechanics models. This unfortunate dichotomy in resolution is due to the tremendous computational resources needed to solve large, real-world chemically reacting flow problems in complex flow geometries. As a result, the use of 3D computational simulations is very difficult and the computational design optimization of such systems has been impossible. In a typical reaction mechanism, there can be over thirty important chemical species undergoing more than fifty reactions. Our results demonstrate that large parallel machines can provide the performance and memory necessary to solve such chemically reacting flow problems in 3D with an equal emphasis on flow and reaction kinetics. However, achieving this performance requires addressing specific difficulties associated with unstructured finite element (FE) implementations on distributed memory computers. To date, MPSalsa [9] has been used for production computational analysis of CVD reactors, multi-phase porous media flow, low Mach number compressible neutral-gas simulations for an ion source tube and simulation of alpha particle detectors for nuclear nonproliferation studies.

The Sandia-Intel TFlop computer is a general purpose massively parallel processor (MPP) with 3600 (eventually 4500) shared memory compute nodes where each node is comprised of dual 200 MHz Intel Pentium Pro processors and 128 MBytes of memory. This machine set the MP-Linpack world record of 1.05 Tflop/s in December 1996. In contrast, MPSalsa is a real, complex, production scientific code with a total over 150,000 lines of C source code and a User's Guide [7]. Achieving high performance on such a large application code on a MPP is very challenging. In this brief summary we discuss two such challenges: the single CPU performance of the underlying kernel algorithms and the use of multiple shared memory CPUs at the nodes of the parallel machine. In particular, the choice of data structures that make efficient use of the memory hierarchy and data throughput throughout the code is critical to overall performance. The ideal structure combines efficient memory usage with high computational throughput and allows implementation of robust solution methods. These issues have been addressed in MPSalsa, enabling unprecedented performance in the solution of implicit unstructured FE problems.

2. Numerical Methods

MPSalsa computes the solution of the conservation equations for momentum, total mass, thermal energy, and individual gas and surface phase chemical species for low Mach number flows. As Table 1 shows, these equations form a complex set of coupled, nonlinear PDEs. Necessary transport properties, diffusion coefficients, kinetic rate constants and diffusion velocities are obtained from the CHEMKIN [5, 6] library.

Momentum
Total Mass
Thermal Energy
Species Mass Fraction for Species k
Table 1. Governing Conservation Equations

The continuous problem defined by the governing conservation equations is spatially approximated by a Petrov-Galerkin finite element method. For steady-state simulations, the nonlinear PDEs are solved using an inexact Newton method with back-tracking [11]. For transient simulations, the spatial approximation is coupled with first- and second-order dynamically controlled time stepping methods. At each time step, an implicit solution of the nonlinear PDEs is obtained by the inexact Newton method. In both cases, the resulting linear systems are solved iteratively using preconditioned Krylov techniques as implemented in the Aztec library [4].

3. Parallel Implementation

3.1. Efficient Data Structures

MPSalsa stores fully summed equations in a sparse matrix format that allows a minimal floating-point-operation solution. This approach sums the elemental coefficient matrices at the beginning of the linear solve rather than summing each matrix vector product and hence eliminates redundant computations. However, while efficient in both memory usage and performance during the linear solution phase of MPSalsa, the indirection associated with summing the element integration values directly into this data structure is costly. Thus, we have implemented element-based data structures for use during the matrix-fill algorithm. These data structures are contiguous in memory and can be addressed directly. All of an element's contributions to the Jacobian matrix are stored in the element-based data structure as they are computed, and are accumulated into the sparse, unstructured Jacobian matrix at the end of the computations for the element. In this way, MPSalsa can efficiently fill the sparse, unstructured Jacobian matrix at a high performance rate. These data structures also make use of the second processor on each node during key kernels of the matrix fill algorithm.

The use of the fully summed Jacobian matrix allows the design of more complex and robust preconditioning methods. The matrix is stored in a variable block row (VBR) format [1]. In this approach the matrix has a sparse block structure due to the coupling of the physics at each FE node. These blocks are stored contiguously in memory so that the indirect addressing of other sparse matrix formats is replaced with directly addressed dense matrix vector multiplications, e.g. those in level-2 BLAS.

3.2. Krylov Iterative Solvers

The parallel Krylov algorithms implemented in Aztec include CG, CGS, GMRES, CGSTAB and TFQMR. The available preconditioners are row sum and block Jacobi scaling, block Jacobi preconditioning, Neumann series and least-squares polynomial methods, and many additive Schwarz domain decomposition preconditioners using various ILU factorizations with variable overlap. The main kernels of the iterative methods are matrix-vector products, DAXPY type operations and vector inner products. The key to performance in these solvers is the efficiency of the matrix-vector multiplication kernel, where interprocessor communication time must be minimized. In forming the sparse matrix and vectors, each processor is assigned a set of unknowns corresponding to a set of rows in the sparse matrix and associated vectors [8]. This set is further subdivided into border and internal unknowns. Border unknowns are those which must communicate with neighboring processors to complete the matrix-vector multiplication; the remaining unknowns owned by a processor are designated as internal. Those unknowns required for a processor's computations but assigned to a neighboring processor are designated external. This division of nodes into internal, border and external sets allows computation and communication to be overlapped; since calculations on the internal nodes require no updated values, their computations can proceed simultaneously with communication.

3.3. Partitioning

A preprocessing step in many parallel simulations is partitioning the domain into subdomains and assigning the subdomains to processors. The partitions used for the results presented in Section 4 were generated using Chaco [3], a general graph (or mesh) partitioning tool which supports a variety of new and established graph partitioning heuristics. These include spectral techniques, geometric methods, multilevel algorithms and the Kernighan-Lin method. Using these techniques, a problem mapping was constructed with low communication volume, good load balance, few message start-ups and little congestion, which also facilitated the overlapping of communication and computation.

4. Results

Our discussion of performance starts with the effect of the VBR data structure block size on performance of the iterative solver kernel on a single node. In MPSalsa, the VBR block size is determined from the total number of unknown quantities solved for at each FE node. For 3D fluid flow, energy and mass transfer, there are three fluid velocities ui, the hydrodynamic pressure P, the temperature T, and Ns gas-phase species mass fractions Yk, totaling (5 + Ns) unknowns at each FE node. In Figure 1, we show the increase in performance for the TFQMR iterative solver with increasing block size. This increase in performance with VBR block size is due to a reduction in the number of indirect addressing operations per floating point operation in the matrix-vector multiplication step.

Figure 1. Single node performance as a function of block size.

To demonstrate the breadth of MPSalsa's robust and efficient algorithms, we show in Table 2 the time to solution for a steady state calculation of a one-million-element tilted CVD reactor with more than eight million unknowns. The reactor configuration is presented in Figure 2, along with the surface elements of a simplified 43,568 element mesh. The reactants enter the reactor through the rectangular region on the left. The bottom of the reactor slopes up at a nine-degree angle so that the outlet of the reactor has only one quarter the area of the inlet face. The rectangular susceptor has a disk in its middle that spins to produce more uniform deposition. The tilted base of the reactor is designed to offset the decline in deposition rate down the length of the reactor due to reactant consumption by accelerating the flow and shrinking the boundary layer over the disk, thereby decreasing mass transfer resistance to the surface. The robust methods in MPSalsa allow direct calculation of the steady state without the need to time step to the solution. A solution for this reactor is shown in Figure 3. MPSalsa has also been used to computationally optimize this reactor using a conjugate gradient optimization routine, entailing 100 solutions [2].

# TFlop nodes Time to Load Mesh Output Time Newton Steps Aztec Iterations Matrix Fill Time Solver Time Overall Time
1000 260 44 17 9834 360.0 7605.0 8269.0

Table 2. Time to Steady Solution for Tilted CVD GaAs Reactor (1.0x106 elements, 8,360,664 unknowns). Gmres(150) - ILUT(1.5,0.0). Times are in seconds.

Figure 2. Simplified mesh for the tilted CVD reactor.

Figure 3. Visualization of the steady-state solution of Gallium Arsenide deposition.

MPSalsa has also been used to simulate a vertical CVD reactor used for Gallium Arsenide (GaAs) deposition. The reactor geometry consists of a vertical cylinder sitting concentrically inside a larger cylindrical reaction vessel. Flow enters uniformly through the circular cross-section of the reactor. The inner cylinder is rotated with reaction occurring on its top heated surface. Flow exits through the annular region between the cylinders. Very uniform growth has been observed in this reactor over a large central section of the disk where the effects of a finite radius system are small. Figure 4 shows a coarse grid steady state solution of a rotating disk reactor and Figure 5 shows how these solutions match experimental results.

Figure 4. Streamlines through the rotating disk reactor.

Figure 5. Predicted deposition rates match experiments over a range of parameters.

For these reactors, experimentalists have also seen transient behavior, which is much more computationally demanding to simulate. Visualizations of transient simulations are shown in Figure 6 and Figure 7. Table 3 shows the scaling performance of MPSalsa for such a simulation. A mechanism with 19 species yields a block size of 24. A half-million node mesh was used on 900 nodes of the Sandia-Intel TFlop computer, while a two million node mesh was used on the full 3600 nodes, attaining a 97.7% relative scaled efficiency. Clearly, both the matrix fill and iterative solver are highly scalable. In addition, use of the second processor on each compute node reduced the matrix fill time by 22% and the solve time by 12%.

Figure 6. Calculation of three-dimensional flow instability that was observed experimentally.

Figure 7. Temperature histories at symmetric positions show m=2 spatio-temporal instability.

Table 4 shows the performance of MPSalsa on a transient run using the TFQMR solver, which balances robustness with memory efficiency. As discussed earlier, the solver speed increases with block size, reaching a peak rate of 226 Gflop/s. The 8x8 block size run used a simple Gallium Arsenide mechanism, which attained a very fast fill rate, but the solver rate suffered due to the small block size. The Gallium Arsenide deposition problem with the full 19 species mechanism, run on a 2 million node mesh, attained a sustained rate of 211 Gflop/s over the entire code: matrix fill, linear solver, and time to output the solution. The use of the dual processor gained approximately 15% overall. The 36x36 block size includes 31 species for the combustion of methane, and several hundred reactions.

Mesh Size # Unknowns # Tflop Nodes # CPUs per Node Matrix Fill Solver Overall Scaled Efficiency
Time Gflop/s Time Gflop/s Time Gflop/s
2.0x106 4.8x107 3600 2 36.6 110.8 236.6 205.5 273.2 192.4 97.7%
0.5x106 1.2x107 900 2 36.1 27.3 229.2 52.6 265.3 49.2
1 44.1 22.3 251.0 48.0 295.1 44.2
Table 3. Scaled Performance of MPSalsa (per Newton iteration) on transient Rotating Disk CVD GaAs Reactor. GMRES(100) iterative solver was used, with block Jacobi matrix scaling. Times are in seconds.

Mesh Size Block Size # Unknowns I/O Time (amortized over 5 time steps) # CPUs Per Node Matrix Fill Solver Overall
Time Gflop/s Time Gflop/s Time Gflop/s
2.0x106 8x8 1.6x107 0.9 2 3.4 122.6 66.2 130.2 70.5 128.1
1 4.2 99.0 67.5 127.0 72.6 123.8
2.0x106 24x24 4.8x107 2.6 2 36.6 110.8 343.9 224.0 383.1 211.7
1 45.0 87.5 392.0 196.3 439.6 183.9
0.75x106 36x36 2.4x107 1.5 2 91.1 ~88.7 280.8 226.1 373.4 ~191.6
1 98.5 ~82.0 314.7 201.8 414.4 ~172.8
Table 4. Performance of MPSalsa on 3600 nodes of the Sandia-Intel TFlop computer (per Newton iteration), including single and dual processor runs. TFQMR iterative solver was used, with block Jacobi matrix scaling. Times are in seconds. 6-8 Newton steps per time step.
~There were inconsistencies in the performance monitoring for the matrix fill of this run, but we believe these numbers are accurate.

5. Conclusions

We have developed a parallel code to simulate chemically reacting flows on complex geometries with the specific goal of modeling chemically reacting flow systems, such as chemical vapor deposition reactors. The code solves complex 3D flow, heat transfer, mass transfer and nonequilibrium chemical reaction equations on unstructured grids. Through a combination of innovative algorithms and data structures, this production scientific code has sustained over 210 Gflop/s on 3600 nodes of the Sandia-Intel TFlop computer while simulating a problem of great interest to Sandia and the semiconductor industry.

Acknowledgments

The authors gratefully acknowledge Greg Henry at Intel for providing key assembly-coded routines.

References

1. S. Carney, M. Heroux and G. Li. A proposal for a sparse BLAS toolkit, SPARKER Working Note #2, Cray Research, Inc., Eagen, MN, (1993).

2. M.S. Eldred, W.E. Hart, W.J. Bohnhoff, V.J. Romero, S.A. Hutchinson, and A.G. Salinger. Proceedings of the 6th AIAA/NASA/ISSMO Symposium on Multidisciplinary Analysis and Optimization, AIAA-96-4164-CP, Bellevue, WA, (1996) 1568-1582.

3. B. Hendrickson and R. Leland. A user's guide to Chaco, Version 1.0, Sandia National Laboratories Technical Report, SAND93-2339, Albuquerque, NM, (1993).

4. S. A. Hutchinson, J. N. Shadid, and R. S. Tuminaro. Aztec User's Guide: Version 1.0, Sandia National Laboratories Technical Report, SAND95-1559, Albuquerque, NM, (1995).

5. R. Kee, G. Dixon-Lewis, J. Warnatz, M. Coltrin, and J. Miller. A Fortran computer code package for the evaluation of gas-phase multicomponent transport properties, Sandia National Laboratories Technical Report, SAND86-8246, Albuquerque, NM, (1986).

6. R.J. Kee, F.M. Rupley, and J.A. Miller. ChemkinII: A Fortran chemical kinetics package for the analysis of gas-phase chemical kinetics, Sandia National Laboratories Technical Report, SAND89-8009, Albuquerque, NM, (1989).

7. A.G. Salinger, K.D. Devine, G.L. Hennigan, H.K. Moffat, S.A. Hutchinson, and J.N. Shadid. MPSalsa: A Finite Element Computer Program for Reacting Flow Problems: Part 2 -- User's Guide and Examples, Sandia National Laboratories Technical Report, SAND96-2331, Albuquerque, NM, (1996).

8. J. Shadid, S. Hutchinson, and H. Moffat. Parallel Performance of a Preconditioned CG Solver for Unstructured Finite Element Applications, Proceedings of the Colorado Conference on Iterative Methods, Breckenridge CO, (1994).

9. J.N. Shadid, H.K. Moffat, S.A. Hutchinson, G.L. Hennigan, K.D. Devine, and A.G. Salinger. MPSalsa: A Finite Element Computer Program for Reacting Flow Problems: Part 1 - Theory Document, Sandia National Laboratories Technical Report, SAND95-2752, Albuquerque, NM, (1996).

10. J. Shadid and R. Tuminaro. A comparison of preconditioned nonsymmetric Krylov methods on a large-scale MIMD machine, SIAM J. Sci. Comput., Vol. 15, No. 2, (1994), 440-459.

11. J.N. Shadid, R.S. Tuminaro, and H.F. Walker. An Inexact Newton Method for Fully-Coupled Solution of the Navier-Stokes Equations with Heat and Mass Transport. Sandia National Laboratories Technical Report, SAND97-0132, Albuquerque, NM, (1997).