Bruce Hendrickson,
David Womble
Sandia National Laboratories
Albuquerque, NM 87185-1110
SIAM Journal of Scientific Computing
Volume 15, Number 5, 1994
Pages 1201-1226
Abstract: Dense linear systems of equations are quite common in science and engineering, arising in boundary element methods, least squares problems and other settings. Massively parallel computers will be necessary to solve the large systems required by scientists and engineers, and scalable parallel algorithms for the linear algebra applications must be devised for these machines. A critical step in these algorithms is the mapping of matrix elements to processors. In this paper, we study the use of the {\em torus--wrap mapping} in general dense matrix algorithms, from both theoretical and practical viewpoints. We prove that, under reasonable assumptions, this assignment scheme leads to dense matrix algorithms that achieve (to within a constant factor) the lower bound on interprocessor communication. We also show that the torus--wrap mapping allows algorithms to exhibit less idle time, better load balancing and less memory overhead than the more common row and column mappings. Finally, we discuss practical implementation issues, such as compatibility with BLAS levels 1, 2, and 3, and present the results of implementations of several dense matrix algorithms. These theoretical and experimental results are compared with those obtained from more traditional mappings.