ABSTRACT
This paper studies the resilience of a two-sided factorization and presents a generic algorithm-based approach capable of making two-sided factorizations resilient. We establish the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction (HR) and present the scalability and performance results of a practical implementation. Our method is a hybrid algorithm combining an Algorithm Based Fault Tolerance (ABFT) technique with diskless checkpointing to fully protect the data. We protect the trailing and the initial part of the matrix with checksums, and protect finished panels in the panel scope with diskless checkpoints. Compared with the original HR (the ScaLAPACK PDGEHRD routine) our fault-tolerant algorithm introduces very little overhead, and maintains the same level of scalability. We prove that the overhead shows a decreasing trend as the size of the matrix or the size of the process grid increases.
- E. Anderson, Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen. LAPACK Users' Guide. SIAM, Philadelphia, PA, Third edition, 1999. Google ScholarDigital Library
- M. Berry and M. Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval. SIAM, Philadelphia, Second edition, 2005. Google ScholarDigital Library
- C. H. Bischof and C. V. Loan. The WY Representation for Products of Householder Matrices. In Parallel Processing for Scientific Computing, pages 2--13, 1985. Google ScholarDigital Library
- L. S. Blackford, J. Choi, A. Cleary, E. D'Azeuedo, J. Demmel, I. Dhillon, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users' Guide. SIAM, Philadelphia, PA, 1997. Google ScholarDigital Library
- W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. A checkpoint-on-failure protocol for algorithm-based recovery in standard MPI. In Proceedings of the 18th international conference on Parallel Processing, Euro-Par'12, pages 477--488, 2012. Google ScholarDigital Library
- W. Bland, P. Du, A. Bouteiller, T. Herault, G. Bosilca, and J. Dongarra. Extending the scope of the checkpoint-on-failure protocol for forward recovery in standard MPI. Technical Report UT-CS-12-702, University of Tennessee, 2012.Google Scholar
- D. Boley, G. H. Golub, S. Makar, N. Saxena, and E. J. McCluskey. Floating point fault tolerance with backward error assertions. IEEE Transactions on Computers, 44(2):302--311, February 1995. Google ScholarDigital Library
- G. Bosilca, A. Bouteiller, É. Brunet, F. Cappello, J. Dongarra, A. Guermouche, T. Hérault, Y. Robert, F. Vivien, and D. Zaidouni. Unified Model for Assessing Checkpointing Protocols at Extreme-Scale. Technical Report RR-7950, INRIA, October 2012.Google Scholar
- G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput., 69(4):410--416, April 2009. Google ScholarDigital Library
- M. Bougeret, H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. Using Group Replication for Resilience on Exascale Systems. Technical Report RR-7876, INRIA, February 2012.Google Scholar
- K. Braman, R. Byers, and R. Mathias. The multishift QR algorithm. ii. aggressive early deflation. SIAM J. Matrix Anal. Appl., 23:948--973, 2002. Google ScholarDigital Library
- S. Brin and L. Page. The antaomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 33:107--17, 1998. Also available online at http://infolab.stanford.edu/pub/papers/google.pdf. Google ScholarDigital Library
- K. Bryan and T. Leise. The $25,000,000,000 eigenvector. the linear algebra behind google. SIAM Review, 48(3):569--81, 2006. Also avaiable at http://www.rose-hulman.edu/~bryan/google.html. Google ScholarDigital Library
- H. Casanova, Y. Robert, F. Vivien, and D. Zaidouni. Combining Process Replication and Checkpointing for Resilience on Exascale Systems. Technical Report RR-7951, INRIA, May 2012.Google Scholar
- A. M. Castaldo, R. C. Whaley, and A. T. Chronopoulos. Reducing floating point error in dot product using the superblock family of algorithms. SIAM J. Sci. Comput., 31(2):1156--1174, 2008. Google ScholarDigital Library
- Z. Chen. Scalable techniques for fault tolerant high performance computing. PhD thesis, University of Tennessee, Knoxville, TN, USA, 2006. Google ScholarDigital Library
- T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High Performance Linpack Benchmark: A Fault Tolerant Implementation Without Checkpointing. In Proceedings of the International Conference on Supercomputing, ICS '11, pages 162--171, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- J. Dongarra, P. Beckman, and T. Moore. The international Exascale software project roadmap. Int. J. High Perform. Comput. Appl., 25(1):3--60, February 2011. Google ScholarDigital Library
- J. J. Dongarra, P. Luszczek, and A. Petitet. The LINPACK benchmark: Past, present, and future. Concurrency and Computation: Practice and Experience, 15:1--18, 2003.Google ScholarCross Ref
- J. J. Dongarra and R. A. van de Geijn. Reduction to condensed form for the eigenvalue problem on distributed memory architectures. Parallel Computing, 18(9):973--982, 1992.Google ScholarCross Ref
- P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. SIGPLAN Not., 47(8):225--234, February 2012. Google ScholarDigital Library
- J. G. F. Francis. The QR transformation a unitary analogue to the LR transformation. I. Comput. J., 4:265--271, 1961.Google ScholarCross Ref
- J. G. F. Francis. The QR transformation II. Comput. J., 4:332--345, 1962.Google ScholarCross Ref
- G. H. Golub and C. F. V. Loan. Matrix Computations. The John Hopkins University Press, 4th edition, December 27 2012. ISBN-10: 1421407949, ISBN-13: 978-1421407944.Google Scholar
- L. A. B. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka. Distributed Diskless Checkpoint for Large Scale Systems. In Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on, pages 63--72, 2010. Google ScholarDigital Library
- R. Granat, B. Kågström, and D. Kressner. A novel parallel QR algorithm for hybrid distributed memory HPC systems. SIAM J. Sci. Comput., 32:2345--2378, 2010. Google ScholarDigital Library
- R. Granat, B. Kågström, D. Kressner, and M. Shao. Parallel library software for the multishift QR algorithm with aggressive early deflation. Technical Report UMINF-12.06, Dept. of Computing Science, Umeøa University, Sweden, 2012.Google Scholar
- D. Hakkarinen and Z. Chen. Algorithmic Cholesky Factorization Fault Recovery. In Parallel Distributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1--10, April 2010.Google ScholarCross Ref
- K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput., 33(6):518--528, June 1984. Google ScholarDigital Library
- R. M. K. Braman, R. Byers. The multishift QR algorithm. I. maintaining well-focused shifts and level 3 performance. SIAM J. Matrix Anal. Appl., 23:929--947, 2002. Google ScholarDigital Library
- B. Kågström, D. Kressner, and M. Shao. On aggressive early deflation in parallel variants of the QR algorithm. In PARA 2010, Applied Parallel and Scientific Computing, LNCS, volume 7134, pages 1--10. Springer, 2012. Google ScholarDigital Library
- Y. Kim, J. S. Plank, and J. Dongarra. Fault Tolerant Matrix Operations Using Checksum and Reverse Computation. In 6th Symposium on the Frontiers of Massively Parallel Computation, pages 70--77, Annapolis, MD, October 1996. Google ScholarDigital Library
- J. Langou, Z. Chen, G. Bosilca, and J. Dongarra. Recovery patterns for iterative methods in a parallel unstable environment. SIAM J. Sci. Comput., 30(1):102--116, November 2007. Google ScholarDigital Library
- A. Langville and C. Meyer. Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, 2006. Google ScholarDigital Library
- C.-D. Lu. Scalable Diskless Checkpointing for Large Parallel Systems. PhD thesis, University of Illinois, Urbana, Illinois, USA, 2005. Google ScholarDigital Library
- F. T. Luk and H. Park. Fault-tolerant matrix triangularizations on systolic arrays. IEEE Trans. Comput., 37(11):1434--1438, November 1988. Google ScholarDigital Library
- E. Meneses. Clustering Parallel Applications to Enhance Message Logging Protocol. https://wiki.ncsa.illinois.edu/download/attachments/17630761/INRIA-UIUC-WS4-emenese.pdf?version=1\&modificationDate=1290466786000.Google Scholar
- A. Petitet, C. Whaley, J. Dongarra, and A. Cleary. HPL - a Portable Implementation of the High-Performance Linpack Benchmark for Distributed-memory Computers, September 2008. http://www.netlib.org/benchmark/hpl/.Google Scholar
- J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Trans. Parallel Distrib. Syst., 9(10):972--986, October 1998. Google ScholarDigital Library
- R. Schreiber and C. V. Loan. A storage efficient WY representation for products of householder transformations. SIAM Journal on Scientific and Statistical Computing, 10, 1989. Google ScholarDigital Library
- B. Schroeder and G. A. Gibson. Understanding failures in petascale computers. Journal of Physics: Conference Series, 78, 2007.Google Scholar
- G. W. Stewart. Matrix Algorithms, Volume II: Eigensystems. SIAM: Society for Industrial and Applied Mathematics, First edition, August 2001. Google ScholarDigital Library
- U. von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17:395--416, 2007. Google ScholarDigital Library
- J. H. Wilkinson. The Algebraic Eigenvalue Problem. Oxford University Press, Inc., New York, NY, USA, 1988. ISBN-10: 0198534183, ISBN-13: 978-0198534181. Google ScholarDigital Library
- J. H. Wilkinson. Rounding Errors in Algebraic Processes. Dover, New York, 1994. Google ScholarDigital Library
- E. Yao, R. Wang, M. Chen, G. Tan, and N. Sun. A Case Study of Designing Efficient Algorithm-based Fault Tolerant Application for Exascale Parallelism. In Parallel Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International, pages 438--448, May 2012. Google ScholarDigital Library
Index Terms
- Parallel reduction to hessenberg form with algorithm-based fault tolerance
Recommendations
Orthogonal Hessenberg Reduction and Orthogonal Krylov Subspace Bases
We study necessary and sufficient conditions that a nonsingular matrix A can be B-orthogonally reduced to upper Hessenberg form with small bandwidth. By this we mean the existence of a decomposition AV=VH, where H is upper Hessenberg with few nonzero ...
Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing
We present a Hessenberg reduction (HR) algorithm for hybrid systems of homogeneous multicore with GPU accelerators that can exceed 25x the performance of the corresponding LAPACK algorithm running on current homogeneous multicores. This enormous ...
FT-ScaLAPACK: correcting soft errors on-line for ScaLAPACK cholesky, QR, and LU factorization routines
HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computingIt is well known that soft errors in linear algebra operations can be detected off-line at the end of the computation using algorithm-based fault tolerance (ABFT). However, traditional ABFT usually cannot correct errors in Cholesky, QR, and LU ...
Comments