ABSTRACT
Regular distributions for storing dense matrices on parallel systems are not always used in practice. In many scientific applicati RUMMA) [1] to handle irregularly distributed matrices. Our approach relies on a distribution independent algorithm that provides dynamic load balancing by exploiting data locality and achieves performance as good as the traditional approach which relies on temporary arrays with regular distribution, data redistribution, and matrix multiplication for regular matrices to handle the irregular case. The proposed algorithm is memory-efficient because temporary matrices are not needed. This feature is critical for systems like the IBM Blue Gene/L that offer very limited amount of memory per node. The experimental results demonstrate very good performance across the range of matrix distributions and problem sizes motivated by real applications.
- M. Krishnan, J. Nieplocha, "SRUMMA: A Matrix Multiplication Algorithm Suitable for Clusters and Scalable Shared Memory Systems", Proc. IPDPS'04, 2004.]]Google ScholarCross Ref
- L. E. Cannon, "A cellular computer to implement the Kalman Filter Algorithm", Ph.D. dissertation, Montana State University, 1969.]] Google ScholarDigital Library
- G. C. Fox, S. W. Otto, A. J. G. Hey, "Matrix algorithms on a hypercube I: Matrix multiplication", Parallel Computing, vol. 4, pp. 17--31, 1987.]]Google ScholarCross Ref
- G. C. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors. vol. 1, Prentice Hall, 1988.]] Google ScholarDigital Library
- G.H. Golub, C.H Van Loan. Matrix Computations. Johns Hopkins University Press, 1989.]]Google Scholar
- J. Berntsen, Communication efficient matrix multiplication on hypercubes, Parallel Computing, vol. 12, 1989.]]Google Scholar
- A. Gupta and V. Kumar, "Scalability of Parallel Algorithms for Matrix Multiplication", ICPP'93, 1993.]] Google ScholarDigital Library
- C. Lin and L.Snyder, "A matrix product algorithm and its comparative performance on hypercubes", SHPCC '92.]]Google Scholar
- Q. Luo and J. B. Drake, A Scalable Parallel Strassen's Matrix Multiply Algorithm for Distributed Memory Computers, http://citeseer.nj.nec.com/517382.html]]Google Scholar
- S. Huss-Lederman, E. M. Jacobson, and A. Tsao, "Comparison of Scalable Parallel Matrix Multiplication Libraries," in Proceedings of the Scalable Parallel Libraries Conference, 1994.]]Google Scholar
- C. T. Ho, S. L. Johnsson, A. Edelman, Matrix multiplication on hypercubes using full bandwidth and constant storage, Proc. 6 Distributed Memory Computing Conference. 1991.]]Google Scholar
- H. Gupta and P. Sadayappan, "Communication Efficient Matrix Multiplication on Hypercubes", in Proceedings of the Sixth ACM Symposium on Parallel Algorithms and Architectures, 1994.]] Google ScholarDigital Library
- J. Li, A. Skjellum, and R. D. Falgout, "A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies," Concurrency, Practice and Experience, vol. 9(5), pp. 345--389, 1997.]]Google ScholarCross Ref
- E. Dekel, D. Nassimi, S. Sahni, Parallel matrix and graph algorithms, SIAM J. Computing, vol. 10, 1981.]]Google ScholarCross Ref
- S. Ranka, S. Sahni. Hypercube Algorithms for Image Processing and Pattern Recognition. Springer-Verlag, 1990.]] Google ScholarDigital Library
- J. Choi, J. Dongarra, and D. W. Walker, "PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers," Concurrency: Practice and Experience, vol. 6(7), 1994.]]Google Scholar
- S. Huss-Lederman, E. Jacobson, A. Tsao, and G. Zhang, "Matrix Multiplication on the Intel Touchstone DELTA", Concurrency: Practice and Experience, vol. 6 (7). Oct 1994.]]Google ScholarCross Ref
- R. C. Agarwal, F. Gustavson, M. Zubair, A high performance matrix multiplication algorithm on a distributed memory parallel computer using overlapped communication, IBM J. Research and Development, vol. 38 (6), 1994.]] Google ScholarDigital Library
- R. van de Geijn, R. and J. Watts, "SUMMA: Scalable Universal Matrix Multiplication Algorithm," Concurrency: Practice and Experience, vol. 9(4), pp. 255--274, April 1997.]]Google ScholarDigital Library
- J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and, R. C. Whaley, "A Proposal for a Set of Parallel Basic Linear Algebra Subprograms", University of Tennessee, Tech. Rep. CS-95-292, May 1995.]] Google ScholarDigital Library
- L. S. Blackford et. al., ScaLAPACK Users' Guide, SIAM, 1997, Philadelphia, PA.]] Google ScholarDigital Library
- J. Choi, "A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers", in Proceedings of the 11th International Parallel Processing Symposium (IPPS '97), 1997.]] Google ScholarDigital Library
- C. Addison and Y. Ren, "OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries", in Proceedings EWOMP'01. 2001.]]Google Scholar
- G.R. Luecke, W. Lin, "Scalability and Performance of OpenMP and MPI on a 128-Processor SGI Origin 2000", Concurrency and Computation: Practice and Experience, vol. 13, pp 905--928. 2001.]]Google ScholarCross Ref
- M. Wu, S. Aluru, and R. A. Kendall, "Mixed Mode Matrix Multiplication", Intl. Conf. Cluster Computing '02.]] Google ScholarDigital Library
- T. Betcke, "Performance analysis of various parallelization methods for BLAS3 routines on cluster architectures",John von Neumann-Instituts fur Computing, Tech. Rep. FZJ-ZAM-IB-2000-15, 2000.]]Google Scholar
- J. L. Träff, H. Ritzdorf, R. Hempel "The Implementation of MPI-2 One-Sided Communication for the NEC SX-5", in Proceedings of Supercomputing, 2000.]]Google Scholar
- J. Liu, J. Wu, S. P. Kinis, P. Wyckoff, and D. K. Panda, High Performance RDMA-Based MPI Implementation over InfiniBand, ACM Intl. Conference on Supercomputing, 2003.]] Google ScholarDigital Library
- J. Nieplocha, V. Tipparaju, M. Krishnan, G. Santhanaraman, and D.K. Panda," Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication on Clusters", IEEE CLUSTER'03, 2003.]]Google ScholarCross Ref
- A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, Addison Wesley, 2003.]]Google ScholarDigital Library
- Cray Online documentation. Optimizing Applications on the Cray X1TM System. http://www.cray.com/craydoc/20/manuals/S-2315-50/html-S-2315-50/S-2315-50-toc.html]]Google Scholar
- J. Nieplocha, B. Carpenter, ARMCI: A Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-time Systems, RTSPP IPPS/SDP, 1999.]] Google ScholarDigital Library
- ARMCI Web page. http://www.emsl.pnl.gov/docs/parsoft/armci/]]Google Scholar
- J. Nieplocha, V. Tipparaju, J. Ju, and E. Apra, "One-sided communication on Myrinet", Cluster Computing, 2003.]] Google ScholarDigital Library
- J. Nieplocha, V. Tipparaju, A. Saify, and D. Panda, "Protocols and Strategies for Optimizing Remote Memory Operations on Clusters", Proc CAC/IPDPS'02.2002.]] Google ScholarDigital Library
- M. Krishnan and J. Nieplocha, "Optimizing Parallel Multiplication Operation for Rectangular and Transposed Matrices," In Proceedings of 10th IEEE ICPADS. 2004.]] Google ScholarDigital Library
- T.H. Dunning, Jr. J. Chem. Phys. 90, 1007 (1989).]]Google ScholarCross Ref
- Y. Alexeev, M. Valiev, D. A. Dixon, T. L. Windus, "Ab initio study of catalytic GTP hydrolysis", J. of ACS, '04.]]Google Scholar
- I. Foster et al. "Toward High-Performance Computational Chemistry: I. Scalable Fock Matrix Construction Algorithms", Journal of computational chemistry, vol. 17, No. 1, 109--123, 1996.]]Google ScholarCross Ref
- C. Hsu, Y. Chung, and C. Dow, Efficient Methods for Multi-Dimensional Array Redistribution, Journal of Supercomputing, 17, 23--46, 2000.]] Google ScholarDigital Library
- C. Edwards, P. Geng, A. Patra, and R.Van De Geign, "Parallel Matrix Distributions: Have we been doing it all wrong?", TR-95-39, Department of Computer Sciences, University of Texas, Oct. 1995.]] Google ScholarDigital Library
- Hyuk-Jae Lee, J.A.B. Fortes, Toward data distribution independent parallel matrix multiplication, IPDPS, 1995.]]Google Scholar
- S. D. Kaushik, C.-H. Huangl, R. W. Johnson2, and P. Sadayappan, "An Approach to Communication-Efficient Data Redistribution", Proc. Supercomputing'94, pp: 364--373,1994.]] Google ScholarDigital Library
- Banicescu, Ioana and R. Lu, Experiences with Fractiling in N-Body Simulations, HPC Symposium, 1998.]]Google Scholar
- Kendall et al, "High Performance Computational Chemistry: an Overview of NWChem a Distributed Parallel Application", Computer Phys. Comm., 2000, 128, 260--283.]]Google ScholarCross Ref
Index Terms
- Memory efficient parallel matrix multiplication operation for irregular problems
Recommendations
Scaling Linear Algebra Kernels Using Remote Memory Access
ICPPW '10: Proceedings of the 2010 39th International Conference on Parallel Processing WorkshopsThis paper describes the scalability of linear algebra kernels based on remote memory access approach. The current approach differs from the other linear algebra algorithms by the explicit use of shared memory and remote memory access (RMA) ...
Toward data distribution independent parallel matrix multiplication
IPPS '95: Proceedings of the 9th International Symposium on Parallel ProcessingTo eliminate or reduce initial data redistribution overheads for distributed memory parallel computers, this paper considers the problem of writing data distribution independent (DDI) programs whose functionality and execution time are independent of ...
Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers
Consider any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity O(N ), where 2< 3. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in O(logN) time by using ...
Comments