Article

Memory efficient parallel matrix multiplication operation for irregular problems

Authors:
Manojkumar Krishnan

Pacific Northwest National Laboratory, Richland, WA

Pacific Northwest National Laboratory, Richland, WA
View Profile

,
Jarek Nieplocha

Pacific Northwest National Laboratory, Richland, WA

Pacific Northwest National Laboratory, Richland, WA
View Profile

CF '06: Proceedings of the 3rd conference on Computing frontiersMay 2006Pages 229–240https://doi.org/10.1145/1128022.1128054

Published:03 May 2006Publication History

CF '06: Proceedings of the 3rd conference on Computing frontiers

Pages 229–240

ABSTRACT

Regular distributions for storing dense matrices on parallel systems are not always used in practice. In many scientific applicati RUMMA) [1] to handle irregularly distributed matrices. Our approach relies on a distribution independent algorithm that provides dynamic load balancing by exploiting data locality and achieves performance as good as the traditional approach which relies on temporary arrays with regular distribution, data redistribution, and matrix multiplication for regular matrices to handle the irregular case. The proposed algorithm is memory-efficient because temporary matrices are not needed. This feature is critical for systems like the IBM Blue Gene/L that offer very limited amount of memory per node. The experimental results demonstrate very good performance across the range of matrix distributions and problem sizes motivated by real applications.

References

M. Krishnan, J. Nieplocha, "SRUMMA: A Matrix Multiplication Algorithm Suitable for Clusters and Scalable Shared Memory Systems", Proc. IPDPS'04, 2004.]]Google ScholarCross Ref
L. E. Cannon, "A cellular computer to implement the Kalman Filter Algorithm", Ph.D. dissertation, Montana State University, 1969.]] Google ScholarDigital Library
G. C. Fox, S. W. Otto, A. J. G. Hey, "Matrix algorithms on a hypercube I: Matrix multiplication", Parallel Computing, vol. 4, pp. 17--31, 1987.]]Google ScholarCross Ref
G. C. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors. vol. 1, Prentice Hall, 1988.]] Google ScholarDigital Library
G.H. Golub, C.H Van Loan. Matrix Computations. Johns Hopkins University Press, 1989.]]Google Scholar
J. Berntsen, Communication efficient matrix multiplication on hypercubes, Parallel Computing, vol. 12, 1989.]]Google Scholar
A. Gupta and V. Kumar, "Scalability of Parallel Algorithms for Matrix Multiplication", ICPP'93, 1993.]] Google ScholarDigital Library
C. Lin and L.Snyder, "A matrix product algorithm and its comparative performance on hypercubes", SHPCC '92.]]Google Scholar
Q. Luo and J. B. Drake, A Scalable Parallel Strassen's Matrix Multiply Algorithm for Distributed Memory Computers, http://citeseer.nj.nec.com/517382.html]]Google Scholar
S. Huss-Lederman, E. M. Jacobson, and A. Tsao, "Comparison of Scalable Parallel Matrix Multiplication Libraries," in Proceedings of the Scalable Parallel Libraries Conference, 1994.]]Google Scholar
C. T. Ho, S. L. Johnsson, A. Edelman, Matrix multiplication on hypercubes using full bandwidth and constant storage, Proc. 6 Distributed Memory Computing Conference. 1991.]]Google Scholar
H. Gupta and P. Sadayappan, "Communication Efficient Matrix Multiplication on Hypercubes", in Proceedings of the Sixth ACM Symposium on Parallel Algorithms and Architectures, 1994.]] Google ScholarDigital Library
J. Li, A. Skjellum, and R. D. Falgout, "A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies," Concurrency, Practice and Experience, vol. 9(5), pp. 345--389, 1997.]]Google ScholarCross Ref
E. Dekel, D. Nassimi, S. Sahni, Parallel matrix and graph algorithms, SIAM J. Computing, vol. 10, 1981.]]Google ScholarCross Ref
S. Ranka, S. Sahni. Hypercube Algorithms for Image Processing and Pattern Recognition. Springer-Verlag, 1990.]] Google ScholarDigital Library
J. Choi, J. Dongarra, and D. W. Walker, "PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers," Concurrency: Practice and Experience, vol. 6(7), 1994.]]Google Scholar
S. Huss-Lederman, E. Jacobson, A. Tsao, and G. Zhang, "Matrix Multiplication on the Intel Touchstone DELTA", Concurrency: Practice and Experience, vol. 6 (7). Oct 1994.]]Google ScholarCross Ref
R. C. Agarwal, F. Gustavson, M. Zubair, A high performance matrix multiplication algorithm on a distributed memory parallel computer using overlapped communication, IBM J. Research and Development, vol. 38 (6), 1994.]] Google ScholarDigital Library
R. van de Geijn, R. and J. Watts, "SUMMA: Scalable Universal Matrix Multiplication Algorithm," Concurrency: Practice and Experience, vol. 9(4), pp. 255--274, April 1997.]]Google ScholarDigital Library
J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and, R. C. Whaley, "A Proposal for a Set of Parallel Basic Linear Algebra Subprograms", University of Tennessee, Tech. Rep. CS-95-292, May 1995.]] Google ScholarDigital Library
L. S. Blackford et. al., ScaLAPACK Users' Guide, SIAM, 1997, Philadelphia, PA.]] Google ScholarDigital Library
J. Choi, "A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers", in Proceedings of the 11th International Parallel Processing Symposium (IPPS '97), 1997.]] Google ScholarDigital Library
C. Addison and Y. Ren, "OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries", in Proceedings EWOMP'01. 2001.]]Google Scholar
G.R. Luecke, W. Lin, "Scalability and Performance of OpenMP and MPI on a 128-Processor SGI Origin 2000", Concurrency and Computation: Practice and Experience, vol. 13, pp 905--928. 2001.]]Google ScholarCross Ref
M. Wu, S. Aluru, and R. A. Kendall, "Mixed Mode Matrix Multiplication", Intl. Conf. Cluster Computing '02.]] Google ScholarDigital Library
T. Betcke, "Performance analysis of various parallelization methods for BLAS3 routines on cluster architectures",John von Neumann-Instituts fur Computing, Tech. Rep. FZJ-ZAM-IB-2000-15, 2000.]]Google Scholar
J. L. Träff, H. Ritzdorf, R. Hempel "The Implementation of MPI-2 One-Sided Communication for the NEC SX-5", in Proceedings of Supercomputing, 2000.]]Google Scholar
J. Liu, J. Wu, S. P. Kinis, P. Wyckoff, and D. K. Panda, High Performance RDMA-Based MPI Implementation over InfiniBand, ACM Intl. Conference on Supercomputing, 2003.]] Google ScholarDigital Library
J. Nieplocha, V. Tipparaju, M. Krishnan, G. Santhanaraman, and D.K. Panda," Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication on Clusters", IEEE CLUSTER'03, 2003.]]Google ScholarCross Ref
A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, Addison Wesley, 2003.]]Google ScholarDigital Library
Cray Online documentation. Optimizing Applications on the Cray X1TM System. http://www.cray.com/craydoc/20/manuals/S-2315-50/html-S-2315-50/S-2315-50-toc.html]]Google Scholar
J. Nieplocha, B. Carpenter, ARMCI: A Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-time Systems, RTSPP IPPS/SDP, 1999.]] Google ScholarDigital Library
ARMCI Web page. http://www.emsl.pnl.gov/docs/parsoft/armci/]]Google Scholar
J. Nieplocha, V. Tipparaju, J. Ju, and E. Apra, "One-sided communication on Myrinet", Cluster Computing, 2003.]] Google ScholarDigital Library
J. Nieplocha, V. Tipparaju, A. Saify, and D. Panda, "Protocols and Strategies for Optimizing Remote Memory Operations on Clusters", Proc CAC/IPDPS'02.2002.]] Google ScholarDigital Library
M. Krishnan and J. Nieplocha, "Optimizing Parallel Multiplication Operation for Rectangular and Transposed Matrices," In Proceedings of 10th IEEE ICPADS. 2004.]] Google ScholarDigital Library
T.H. Dunning, Jr. J. Chem. Phys. 90, 1007 (1989).]]Google ScholarCross Ref
Y. Alexeev, M. Valiev, D. A. Dixon, T. L. Windus, "Ab initio study of catalytic GTP hydrolysis", J. of ACS, '04.]]Google Scholar
I. Foster et al. "Toward High-Performance Computational Chemistry: I. Scalable Fock Matrix Construction Algorithms", Journal of computational chemistry, vol. 17, No. 1, 109--123, 1996.]]Google ScholarCross Ref
C. Hsu, Y. Chung, and C. Dow, Efficient Methods for Multi-Dimensional Array Redistribution, Journal of Supercomputing, 17, 23--46, 2000.]] Google ScholarDigital Library
C. Edwards, P. Geng, A. Patra, and R.Van De Geign, "Parallel Matrix Distributions: Have we been doing it all wrong?", TR-95-39, Department of Computer Sciences, University of Texas, Oct. 1995.]] Google ScholarDigital Library
Hyuk-Jae Lee, J.A.B. Fortes, Toward data distribution independent parallel matrix multiplication, IPDPS, 1995.]]Google Scholar
S. D. Kaushik, C.-H. Huangl, R. W. Johnson2, and P. Sadayappan, "An Approach to Communication-Efficient Data Redistribution", Proc. Supercomputing'94, pp: 364--373,1994.]] Google ScholarDigital Library
Banicescu, Ioana and R. Lu, Experiences with Fractiling in N-Body Simulations, HPC Symposium, 1998.]]Google Scholar
Kendall et al, "High Performance Computational Chemistry: an Overview of NWChem a Distributed Parallel Application", Computer Phys. Comm., 2000, 128, 260--283.]]Google ScholarCross Ref

Index Terms

Memory efficient parallel matrix multiplication operation for irregular problems
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Scaling Linear Algebra Kernels Using Remote Memory Access
ICPPW '10: Proceedings of the 2010 39th International Conference on Parallel Processing Workshops

This paper describes the scalability of linear algebra kernels based on remote memory access approach. The current approach differs from the other linear algebra algorithms by the explicit use of shared memory and remote memory access (RMA) ...
Read More
Toward data distribution independent parallel matrix multiplication
IPPS '95: Proceedings of the 9th International Symposium on Parallel Processing

To eliminate or reduce initial data redistribution overheads for distributed memory parallel computers, this paper considers the problem of writing data distribution independent (DDI) programs whose functionality and execution time are independent of ...
Read More
Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers

Consider any known sequential algorithm for matrix multiplication over an arbitrary ring with time complexity O(N ), where 2< 3. We show that such an algorithm can be parallelized on a distributed memory parallel computer (DMPC) in O(logN) time by using ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CF '06: Proceedings of the 3rd conference on Computing frontiers
May 2006
430 pages
ISBN:1595933026
DOI:10.1145/1128022
General Chairs:
Monica Alderighi
IASF - INAF
,
Valentina Salapura
IBM
,
Program Chair:
Sally A. McKee
Cornell University
Copyright © 2006 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 May 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
SRUMMA
global arrays
irregular distribution
parallel linear algebra
parallel matrix multiplication
parallel programming
remote memory access
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate240of680submissions,35%
Upcoming Conference
CF '24

Sponsor:

sigmicro

21st ACM International Conference on Computing Frontiers

May 7 - 9, 2024

Ischia , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 708
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Memory efficient parallel matrix multiplication operation for irregular problems

CF '06: Proceedings of the 3rd conference on Computing frontiers

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scaling Linear Algebra Kernels Using Remote Memory Access

Toward data distribution independent parallel matrix multiplication

Scalable Parallel Matrix Multiplication on Distributed Memory Parallel Computers