|
ABSTRACT
Regular distributions for storing dense matrices on parallel systems are not always used in practice. In many scientific applicati RUMMA) [1] to handle irregularly distributed matrices. Our approach relies on a distribution independent algorithm that provides dynamic load balancing by exploiting data locality and achieves performance as good as the traditional approach which relies on temporary arrays with regular distribution, data redistribution, and matrix multiplication for regular matrices to handle the irregular case. The proposed algorithm is memory-efficient because temporary matrices are not needed. This feature is critical for systems like the IBM Blue Gene/L that offer very limited amount of memory per node. The experimental results demonstrate very good performance across the range of matrix distributions and problem sizes motivated by real applications.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
M. Krishnan, J. Nieplocha, "SRUMMA: A Matrix Multiplication Algorithm Suitable for Clusters and Scalable Shared Memory Systems", Proc. IPDPS'04, 2004.
|
| |
2
|
|
| |
3
|
G. C. Fox, S. W. Otto, A. J. G. Hey, "Matrix algorithms on a hypercube I: Matrix multiplication", Parallel Computing, vol. 4, pp. 17--31, 1987.
|
| |
4
|
Geoffrey C. Fox , Mark A. Johnson , Gregory A. Lyzenga , Steve W. Otto , John K. Salmon , David W. Walker, Solving problems on concurrent processors. Vol. 1: General techniques and regular problems, Prentice-Hall, Inc., Upper Saddle River, NJ, 1988
|
| |
5
|
G.H. Golub, C.H Van Loan. Matrix Computations. Johns Hopkins University Press, 1989.
|
| |
6
|
J. Berntsen, Communication efficient matrix multiplication on hypercubes, Parallel Computing, vol. 12, 1989.
|
| |
7
|
A. Gupta and V. Kumar, "Scalability of Parallel Algorithms for Matrix Multiplication", ICPP'93, 1993.
|
| |
8
|
C. Lin and L.Snyder, "A matrix product algorithm and its comparative performance on hypercubes", SHPCC '92.
|
| |
9
|
Q. Luo and J. B. Drake, A Scalable Parallel Strassen's Matrix Multiply Algorithm for Distributed Memory Computers, http://citeseer.nj.nec.com/517382.html
|
| |
10
|
S. Huss-Lederman, E. M. Jacobson, and A. Tsao, "Comparison of Scalable Parallel Matrix Multiplication Libraries," in Proceedings of the Scalable Parallel Libraries Conference, 1994.
|
| |
11
|
C. T. Ho, S. L. Johnsson, A. Edelman, Matrix multiplication on hypercubes using full bandwidth and constant storage, Proc. 6 Distributed Memory Computing Conference. 1991.
|
 |
12
|
|
| |
13
|
J. Li, A. Skjellum, and R. D. Falgout, "A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies," Concurrency, Practice and Experience, vol. 9(5), pp. 345--389, 1997.
|
| |
14
|
E. Dekel, D. Nassimi, S. Sahni, Parallel matrix and graph algorithms, SIAM J. Computing, vol. 10, 1981.
|
| |
15
|
|
| |
16
|
J. Choi, J. Dongarra, and D. W. Walker, "PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers," Concurrency: Practice and Experience, vol. 6(7), 1994.
|
| |
17
|
S. Huss-Lederman, E. Jacobson, A. Tsao, and G. Zhang, "Matrix Multiplication on the Intel Touchstone DELTA", Concurrency: Practice and Experience, vol. 6 (7). Oct 1994.
|
| |
18
|
|
| |
19
|
R. van de Geijn, R. and J. Watts, "SUMMA: Scalable Universal Matrix Multiplication Algorithm," Concurrency: Practice and Experience, vol. 9(4), pp. 255--274, April 1997.
|
| |
20
|
|
| |
21
|
Jack J. Dongarra , L. S. Blackford , J. Choi , A. Cleary , E. D'Azeuedo , J. Demmel , I. Dhillon , S. Hammarling , G. Henry , A. Petitet , K. Stanley , D. Walker , R. C. Whaley, ScaLAPACK user's guide, Society for Industrial and Applied Mathematics, Philadelphia, PA, 1997
|
| |
22
|
|
| |
23
|
C. Addison and Y. Ren, "OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries", in Proceedings EWOMP'01. 2001.
|
| |
24
|
G.R. Luecke, W. Lin, "Scalability and Performance of OpenMP and MPI on a 128-Processor SGI Origin 2000", Concurrency and Computation: Practice and Experience, vol. 13, pp 905--928. 2001.
|
| |
25
|
|
| |
26
|
T. Betcke, "Performance analysis of various parallelization methods for BLAS3 routines on cluster architectures",John von Neumann-Instituts fur Computing, Tech. Rep. FZJ-ZAM-IB-2000-15, 2000.
|
| |
27
|
J. L. Träff, H. Ritzdorf, R. Hempel "The Implementation of MPI-2 One-Sided Communication for the NEC SX-5", in Proceedings of Supercomputing, 2000.
|
 |
28
|
Jiuxing Liu , Jiesheng Wu , Sushmitha P. Kini , Pete Wyckoff , Dhabaleswar K. Panda, High performance RDMA-based MPI implementation over InfiniBand, Proceedings of the 17th annual international conference on Supercomputing, June 23-26, 2003, San Francisco, CA, USA
[doi> 10.1145/782814.782855]
|
| |
29
|
J. Nieplocha, V. Tipparaju, M. Krishnan, G. Santhanaraman, and D.K. Panda," Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication on Clusters", IEEE CLUSTER'03, 2003.
|
| |
30
|
A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, Addison Wesley, 2003.
|
| |
31
|
Cray Online documentation. Optimizing Applications on the Cray X1TM System. http://www.cray.com/craydoc/20/manuals/S-2315-50/html-S-2315-50/S-2315-50-toc.html
|
| |
32
|
|
| |
33
|
ARMCI Web page. http://www.emsl.pnl.gov/docs/parsoft/armci/
|
| |
34
|
|
| |
35
|
|
| |
36
|
|
| |
37
|
T.H. Dunning, Jr. J. Chem. Phys. 90, 1007 (1989).
|
| |
38
|
Y. Alexeev, M. Valiev, D. A. Dixon, T. L. Windus, "Ab initio study of catalytic GTP hydrolysis", J. of ACS, '04.
|
| |
39
|
I. Foster et al. "Toward High-Performance Computational Chemistry: I. Scalable Fock Matrix Construction Algorithms", Journal of computational chemistry, vol. 17, No. 1, 109--123, 1996.
|
| |
40
|
|
| |
41
|
|
| |
42
|
Hyuk-Jae Lee, J.A.B. Fortes, Toward data distribution independent parallel matrix multiplication, IPDPS, 1995.
|
 |
43
|
S. D. Kaushik , C.-H. Huang , R. W. Johnson , P. Sadayappan, An approach to communication-efficient data redistribution, Proceedings of the 8th international conference on Supercomputing, p.364-373, July 11-15, 1994, Manchester, England
[doi> 10.1145/181181.181563]
|
| |
44
|
Banicescu, Ioana and R. Lu, Experiences with Fractiling in N-Body Simulations, HPC Symposium, 1998.
|
| |
45
|
Kendall et al, "High Performance Computational Chemistry: an Overview of NWChem a Distributed Parallel Application", Computer Phys. Comm., 2000, 128, 260--283.
|
CITED BY
|
|
Muhammad Hafeez , Muhammad Younus , Abdur Rehman , Athar Mohsin, Optimal solution to matrix parenthesization problem employing parallel processing approach, Proceedings of the 8th Conference on 8th WSEAS International Conference on Evolutionary Computing, p.235-240, June 19-21, 2007, Vancouver, British Columbia, Canada
|
|