skip to main content
10.1145/2503210.2503249acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Parallel reduction to hessenberg form with algorithm-based fault tolerance

Published:17 November 2013Publication History

ABSTRACT

This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.

References

  1. E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. SIAM, Philadelphia, PA, Third edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Berry and M. Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia, Second edition, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. H. Bischof and C. V. Loan. The WY Representation for Products of Householder Matrices. In Parallel Processing for Scientific Computing, pages 2--13, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. S. Blackford, J. Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI. In Proceedings of the 18th international conference on Parallel Processing, Euro-Par'12, pages 477--488, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Extending the scope of the checkpoint-on-failure protocol for forward recovery in standard MPI. Technical Report UT-CS-12-702, University of Tennessee, 2012.Google ScholarGoogle Scholar
  7. D. Boley, G. H. Golub, S. Makar, N. Saxena, and E. J. McCluskey. Floating point fault tolerance with backward error assertions. IEEE Transactions on Computers, 44(2):302--311, February 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Bosilca, A. Bouteiller, É. Brunet, F. Cappello, J. Dongarra, A. Guermouche, T. Hérault, Y. Robert, F. Vivien, and D. Zaidouni. Unified Model for Assessing Checkpointing Protocols at Extreme-Scale. Technical Report RR-7950, INRIA, October 2012.Google ScholarGoogle Scholar
  9. G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput., 69(4):410--416, April 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Bougeret, H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. Using Group Replication for Resilience on Exascale Systems. Technical Report RR-7876, INRIA, February 2012.Google ScholarGoogle Scholar
  11. K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm. ii. aggressive early deflation. SIAM J. Matrix Anal. Appl., 23:948--973, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Brin and L. Page. The antaomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 33:107--17, 1998. Also available online at http://infolab.stanford.edu/pub/papers/google.pdf. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Bryan and T. Leise. The $25,000,000,000 eigenvector. the linear algebra behind google. SIAM Review, 48(3):569--81, 2006. Also avaiable at http://www.rose-hulman.edu/~bryan/google.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. Combining Process Replication and Checkpointing for Resilience on Exascale Systems. Technical Report RR-7951, INRIA, May 2012.Google ScholarGoogle Scholar
  15. A. M. Castaldo, R. C. Whaley, and A. T. Chronopoulos. Reducing floating point error in dot product using the superblock family of algorithms. SIAM J. Sci. Comput., 31(2):1156--1174, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z. Chen. Scalable techniques for fault tolerant high performance computing. PhD thesis, University of Tennessee, Knoxville, TN, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation Without Checkpointing. In Proceedings of the International Conference on Supercomputing, ICS '11, pages 162--171, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Dongarra, P. Beckman, and T. Moore. The international Exascale software project roadmap. Int. J. High Perform. Comput. Appl., 25(1):3--60, February 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice and Experience, 15:1--18, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  20. J. J. Dongarra and R. A. van de Geijn. Reduction to condensed form for the eigenvalue problem on distributed memory architectures. Parallel Computing, 18(9):973--982, 1992.Google ScholarGoogle ScholarCross RefCross Ref
  21. P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. SIGPLAN Not., 47(8):225--234, February 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. G. F. Francis. The QR transformation a unitary analogue to the LR transformation. I. Comput. J., 4:265--271, 1961.Google ScholarGoogle ScholarCross RefCross Ref
  23. J. G. F. Francis. The QR transformation II. Comput. J., 4:332--345, 1962.Google ScholarGoogle ScholarCross RefCross Ref
  24. G. H. Golub and C. F. V. Loan. Matrix Computations. The John Hopkins University Press, 4th edition, December 27 2012. ISBN-10: 1421407949, ISBN-13: 978-1421407944.Google ScholarGoogle Scholar
  25. L. A. B. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka. Distributed Diskless Checkpoint for Large Scale Systems. In Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on, pages 63--72, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. R. Granat, B. Kågström, and D. Kressner. A novel parallel QR algorithm for hybrid distributed memory HPC systems. SIAM J. Sci. Comput., 32:2345--2378, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Granat, B. Kågström, D. Kressner, and M. Shao. Parallel library software for the multishift QR algorithm with aggressive early deflation. Technical Report UMINF-12.06, Dept. of Computing Science, Umeøa University, Sweden, 2012.Google ScholarGoogle Scholar
  28. D. Hakkarinen and Z. Chen. Algorithmic Cholesky Factorization Fault Recovery. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10, April 2010.Google ScholarGoogle ScholarCross RefCross Ref
  29. K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput., 33(6):518--528, June 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. M. K. Braman, R. Byers. The multishift QR algorithm. I. maintaining well-focused shifts and level 3 performance. SIAM J. Matrix Anal. Appl., 23:929--947, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. B. Kågström, D. Kressner, and M. Shao. On aggressive early deflation in parallel variants of the QR algorithm. In PARA 2010, Applied Parallel and Scientific Computing, LNCS, volume 7134, pages 1--10. Springer, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Kim, J. S. Plank, and J. Dongarra. Fault Tolerant Matrix Operations Using Checksum and Reverse Computation. In 6th Symposium on the Frontiers of Massively Parallel Computation, pages 70--77, Annapolis, MD, October 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput., 30(1):102--116, November 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Langville and C. Meyer. Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C.-D. Lu. Scalable Diskless Checkpointing for Large Parallel Systems. PhD thesis, University of Illinois, Urbana, Illinois, USA, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. F. T. Luk and H. Park. Fault-tolerant matrix triangularizations on systolic arrays. IEEE Trans. Comput., 37(11):1434--1438, November 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. E. Meneses. Clustering Parallel Applications to Enhance Message Logging Protocol. https://wiki.ncsa.illinois.edu/download/attachments/17630761/INRIA-UIUC-WS4-emenese.pdf?version=1\&modificationDate=1290466786000.Google ScholarGoogle Scholar
  38. A. Petitet, C. Whaley, J. Dongarra, and A. Cleary. HPL - a Portable Implementation of the High-Performance Linpack Benchmark for Distributed-memory Computers, September 2008. http://www.netlib.org/benchmark/hpl/.Google ScholarGoogle Scholar
  39. J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., 9(10):972--986, October 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. R. Schreiber and C. V. Loan. A storage efficient WY representation for products of householder transformations. SIAM Journal on Scientific and Statistical Computing, 10, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. B. Schroeder and G. A. Gibson. Understanding failures in petascale computers. Journal of Physics: Conference Series, 78, 2007.Google ScholarGoogle Scholar
  42. G. W. Stewart. Matrix Algorithms, Volume II: Eigensystems. SIAM: Society for Industrial and Applied Mathematics, First edition, August 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. U. von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17:395--416, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. H. Wilkinson. The Algebraic Eigenvalue Problem. Oxford University Press, Inc., New York, NY, USA, 1988. ISBN-10: 0198534183, ISBN-13: 978-0198534181. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. J. H. Wilkinson. Rounding Errors in Algebraic Processes. Dover, New York, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. E. Yao, R. Wang, M. Chen, G. Tan, and N. Sun. A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 438--448, May 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Parallel reduction to hessenberg form with algorithm-based fault tolerance

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SC '13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis
      November 2013
      1123 pages
      ISBN:9781450323789
      DOI:10.1145/2503210
      • General Chair:
      • William Gropp,
      • Program Chair:
      • Satoshi Matsuoka

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 November 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SC '13 Paper Acceptance Rate91of449submissions,20%Overall Acceptance Rate1,516of6,373submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader