skip to main content
10.1145/1128022.1128054acmconferencesArticle/Chapter ViewAbstractPublication PagescfConference Proceedingsconference-collections
Article

Memory efficient parallel matrix multiplication operation for irregular problems

Published:03 May 2006Publication History

ABSTRACT

Regular distributions for storing dense matrices on parallel systems are not always used in practice. In many scientific applicati RUMMA) [1] to handle irregularly distributed matrices. Our approach relies on a distribution independent algorithm that provides dynamic load balancing by exploiting data locality and achieves performance as good as the traditional approach which relies on temporary arrays with regular distribution, data redistribution, and matrix multiplication for regular matrices to handle the irregular case. The proposed algorithm is memory-efficient because temporary matrices are not needed. This feature is critical for systems like the IBM Blue Gene/L that offer very limited amount of memory per node. The experimental results demonstrate very good performance across the range of matrix distributions and problem sizes motivated by real applications.

References

  1. M. Krishnan, J. Nieplocha, "SRUMMA: A Matrix Multiplication Algorithm Suitable for Clusters and Scalable Shared Memory Systems", Proc. IPDPS'04, 2004.]]Google ScholarGoogle ScholarCross RefCross Ref
  2. L. E. Cannon, "A cellular computer to implement the Kalman Filter Algorithm", Ph.D. dissertation, Montana State University, 1969.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. C. Fox, S. W. Otto, A. J. G. Hey, "Matrix algorithms on a hypercube I: Matrix multiplication", Parallel Computing, vol. 4, pp. 17--31, 1987.]]Google ScholarGoogle ScholarCross RefCross Ref
  4. G. C. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, and D. Walker, Solving Problems on Concurrent Processors. vol. 1, Prentice Hall, 1988.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G.H. Golub, C.H Van Loan. Matrix Computations. Johns Hopkins University Press, 1989.]]Google ScholarGoogle Scholar
  6. J. Berntsen, Communication efficient matrix multiplication on hypercubes, Parallel Computing, vol. 12, 1989.]]Google ScholarGoogle Scholar
  7. A. Gupta and V. Kumar, "Scalability of Parallel Algorithms for Matrix Multiplication", ICPP'93, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Lin and L.Snyder, "A matrix product algorithm and its comparative performance on hypercubes", SHPCC '92.]]Google ScholarGoogle Scholar
  9. Q. Luo and J. B. Drake, A Scalable Parallel Strassen's Matrix Multiply Algorithm for Distributed Memory Computers, http://citeseer.nj.nec.com/517382.html]]Google ScholarGoogle Scholar
  10. S. Huss-Lederman, E. M. Jacobson, and A. Tsao, "Comparison of Scalable Parallel Matrix Multiplication Libraries," in Proceedings of the Scalable Parallel Libraries Conference, 1994.]]Google ScholarGoogle Scholar
  11. C. T. Ho, S. L. Johnsson, A. Edelman, Matrix multiplication on hypercubes using full bandwidth and constant storage, Proc. 6 Distributed Memory Computing Conference. 1991.]]Google ScholarGoogle Scholar
  12. H. Gupta and P. Sadayappan, "Communication Efficient Matrix Multiplication on Hypercubes", in Proceedings of the Sixth ACM Symposium on Parallel Algorithms and Architectures, 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Li, A. Skjellum, and R. D. Falgout, "A Poly-Algorithm for Parallel Dense Matrix Multiplication on Two-Dimensional Process Grid Topologies," Concurrency, Practice and Experience, vol. 9(5), pp. 345--389, 1997.]]Google ScholarGoogle ScholarCross RefCross Ref
  14. E. Dekel, D. Nassimi, S. Sahni, Parallel matrix and graph algorithms, SIAM J. Computing, vol. 10, 1981.]]Google ScholarGoogle ScholarCross RefCross Ref
  15. S. Ranka, S. Sahni. Hypercube Algorithms for Image Processing and Pattern Recognition. Springer-Verlag, 1990.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Choi, J. Dongarra, and D. W. Walker, "PUMMA: Parallel Universal Matrix Multiplication Algorithms on distributed memory concurrent computers," Concurrency: Practice and Experience, vol. 6(7), 1994.]]Google ScholarGoogle Scholar
  17. S. Huss-Lederman, E. Jacobson, A. Tsao, and G. Zhang, "Matrix Multiplication on the Intel Touchstone DELTA", Concurrency: Practice and Experience, vol. 6 (7). Oct 1994.]]Google ScholarGoogle ScholarCross RefCross Ref
  18. R. C. Agarwal, F. Gustavson, M. Zubair, A high performance matrix multiplication algorithm on a distributed memory parallel computer using overlapped communication, IBM J. Research and Development, vol. 38 (6), 1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. van de Geijn, R. and J. Watts, "SUMMA: Scalable Universal Matrix Multiplication Algorithm," Concurrency: Practice and Experience, vol. 9(4), pp. 255--274, April 1997.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Choi, J. Dongarra, S. Ostrouchov, A. Petitet, D. Walker, and, R. C. Whaley, "A Proposal for a Set of Parallel Basic Linear Algebra Subprograms", University of Tennessee, Tech. Rep. CS-95-292, May 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. S. Blackford et. al., ScaLAPACK Users' Guide, SIAM, 1997, Philadelphia, PA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Choi, "A Fast Scalable Universal Matrix Multiplication Algorithm on Distributed-Memory Concurrent Computers", in Proceedings of the 11th International Parallel Processing Symposium (IPPS '97), 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Addison and Y. Ren, "OpenMP Issues Arising in the Development of Parallel BLAS and LAPACK libraries", in Proceedings EWOMP'01. 2001.]]Google ScholarGoogle Scholar
  24. G.R. Luecke, W. Lin, "Scalability and Performance of OpenMP and MPI on a 128-Processor SGI Origin 2000", Concurrency and Computation: Practice and Experience, vol. 13, pp 905--928. 2001.]]Google ScholarGoogle ScholarCross RefCross Ref
  25. M. Wu, S. Aluru, and R. A. Kendall, "Mixed Mode Matrix Multiplication", Intl. Conf. Cluster Computing '02.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Betcke, "Performance analysis of various parallelization methods for BLAS3 routines on cluster architectures",John von Neumann-Instituts fur Computing, Tech. Rep. FZJ-ZAM-IB-2000-15, 2000.]]Google ScholarGoogle Scholar
  27. J. L. Träff, H. Ritzdorf, R. Hempel "The Implementation of MPI-2 One-Sided Communication for the NEC SX-5", in Proceedings of Supercomputing, 2000.]]Google ScholarGoogle Scholar
  28. J. Liu, J. Wu, S. P. Kinis, P. Wyckoff, and D. K. Panda, High Performance RDMA-Based MPI Implementation over InfiniBand, ACM Intl. Conference on Supercomputing, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Nieplocha, V. Tipparaju, M. Krishnan, G. Santhanaraman, and D.K. Panda," Optimizing Mechanisms for Latency Tolerance in Remote Memory Access Communication on Clusters", IEEE CLUSTER'03, 2003.]]Google ScholarGoogle ScholarCross RefCross Ref
  30. A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, Addison Wesley, 2003.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Cray Online documentation. Optimizing Applications on the Cray X1TM System. http://www.cray.com/craydoc/20/manuals/S-2315-50/html-S-2315-50/S-2315-50-toc.html]]Google ScholarGoogle Scholar
  32. J. Nieplocha, B. Carpenter, ARMCI: A Portable Remote Memory Copy Library for Distributed Array Libraries and Compiler Run-time Systems, RTSPP IPPS/SDP, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. ARMCI Web page. http://www.emsl.pnl.gov/docs/parsoft/armci/]]Google ScholarGoogle Scholar
  34. J. Nieplocha, V. Tipparaju, J. Ju, and E. Apra, "One-sided communication on Myrinet", Cluster Computing, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. J. Nieplocha, V. Tipparaju, A. Saify, and D. Panda, "Protocols and Strategies for Optimizing Remote Memory Operations on Clusters", Proc CAC/IPDPS'02.2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Krishnan and J. Nieplocha, "Optimizing Parallel Multiplication Operation for Rectangular and Transposed Matrices," In Proceedings of 10th IEEE ICPADS. 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. T.H. Dunning, Jr. J. Chem. Phys. 90, 1007 (1989).]]Google ScholarGoogle ScholarCross RefCross Ref
  38. Y. Alexeev, M. Valiev, D. A. Dixon, T. L. Windus, "Ab initio study of catalytic GTP hydrolysis", J. of ACS, '04.]]Google ScholarGoogle Scholar
  39. I. Foster et al. "Toward High-Performance Computational Chemistry: I. Scalable Fock Matrix Construction Algorithms", Journal of computational chemistry, vol. 17, No. 1, 109--123, 1996.]]Google ScholarGoogle ScholarCross RefCross Ref
  40. C. Hsu, Y. Chung, and C. Dow, Efficient Methods for Multi-Dimensional Array Redistribution, Journal of Supercomputing, 17, 23--46, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. C. Edwards, P. Geng, A. Patra, and R.Van De Geign, "Parallel Matrix Distributions: Have we been doing it all wrong?", TR-95-39, Department of Computer Sciences, University of Texas, Oct. 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Hyuk-Jae Lee, J.A.B. Fortes, Toward data distribution independent parallel matrix multiplication, IPDPS, 1995.]]Google ScholarGoogle Scholar
  43. S. D. Kaushik, C.-H. Huangl, R. W. Johnson2, and P. Sadayappan, "An Approach to Communication-Efficient Data Redistribution", Proc. Supercomputing'94, pp: 364--373,1994.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Banicescu, Ioana and R. Lu, Experiences with Fractiling in N-Body Simulations, HPC Symposium, 1998.]]Google ScholarGoogle Scholar
  45. Kendall et al, "High Performance Computational Chemistry: an Overview of NWChem a Distributed Parallel Application", Computer Phys. Comm., 2000, 128, 260--283.]]Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Memory efficient parallel matrix multiplication operation for irregular problems

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader