skip to main content
research-article
Free Access

Debugging high-performance computing applications at massive scales

Published:24 August 2015Publication History
Skip Abstract Section

Abstract

Dynamic analysis techniques help programmers find the root cause of bugs in large-scale parallel applications.

References

  1. Ahn, D.H., Arnold, D.C., de Supinski, B.R., Lee, G.L., Miller, B.P., and Schulz, M. Overcoming scalability challenges for tool daemon launching. In Proceedings of the International Conference on Parallel Processing (Portland, OR, Sept. 8--12). IEEE Press, 2008, 578--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arnold, D.C., Ahn, D.H., de Supinski, B.R., Lee, G.L., Miller, B.P., and Schulz, M. Stack trace analysis for large-scale debugging. In Proceedings of the International Parallel and Distributed Processing Symposium (Long Beach, CA, Mar. 26--30). IEEE Press, 2007, pages 1--10.Google ScholarGoogle ScholarCross RefCross Ref
  3. Bronevetsky, G., Laguna, I., Bagchi, S., de Supinski, B.R., Ahn, D.H., and Schulz, M. <code>AutomaDeD:</code> Automata-based debugging for dissimilar parallel tasks. In Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems and Networks (Chicago, IL, June 28--July 1). IEEE Press, 2010, 231--240.Google ScholarGoogle ScholarCross RefCross Ref
  4. Cadar, C. and Sen, K. Symbolic execution for software testing: three decades later. Commun. ACM 56, 2 (Feb. 2013), 82--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chen, Z., Gao, Q., Zhang, W., and Qin, F. <code>FlowChecker</code>: Detecting bugs in MPI libraries via message flow checking. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, LA, Nov. 13--19). IEEE Computer Society, Washington, D.C., 2010, 1--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Dinh, M.N., Abramson, D., and Jin, C. Scalable relative debugging. IEEE Transactions on Parallel and Distributed Systems 25, 3 (Mar. 2014), 740--749. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Gamblin, T., De Supinski, B.R., Schulz, M., Fowler, R., and Reed, D.A. Clustering performance data efficiently at massive scales. In Proceedings of the 24th ACM International Conference on Supercomputing (Tsukuba, Ibaraki, Japan, June 1--4). ACM Press, New York, 2010, 243--252. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Gao, Q., Qin, F., and Panda, D.K. DMTracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (Reno, NV, Nov. 10--16). ACM Press, New York, 2007, 15:1--15:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Gopalakrishnan, G., Kirby, R.M., Siegel, S., Thakur, R., Gropp, W., Lusk, E., De Supinski, B.R., Schulz, M., and Bronevetsky, G. Formal analysis of MPI-based parallel programs. Commun. ACM 54, 12 (Dec. 2011), 82--91. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Gropp, W., Lusk, E., Doss, N., and Skjellum, A. A high-performance, portable implementation of the MPI message-passing interface standard. Parallel Computing 22, 6 (1996), 789--828. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Hilbrich, T., Schulz, M., de Supinski, B.R., and Müller, M.S. MUST: A scalable approach to runtime error detection in MPI programs. Chapter 5 of Tools for High Performance Computing 2009, M.S. Müller et al., Eds. Springer, Berlin, Heidelberg, 2010, 53--66.Google ScholarGoogle Scholar
  12. Kieras, D.E., Meyer, D.E., Ballas, J.A., and Lauber, E.J. Modern computational perspectives on executive mental processes and cognitive control: Where to from here? Chapter 30 of Control of Cognitive Processes: Attention and Performance, S. Monsell and J. Driver, Eds. MIT Press, Cambridge, MA, 2000, 681--712.Google ScholarGoogle Scholar
  13. Kinshumann, K., Glerum, K., Greenberg, S., Aul, G., Orgovan, V., Nichols, G., Grant, G., Loihle, G., and Hunt, G. Debugging in the (very) large: 10 years of implementation and experience. Commun. ACM 54, 7 (July 2011), 111--116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Krammer, B., Müller, M.S., and Resch, M.M. MPI application development using the analysis tool MARMOT. In Proceedings of the Fourth International Conference on Computational Science, M. Bubak et al., Eds. (Kraków, Poland, June 6--9). Springer, Berlin, Heidelberg, 2004, 464--471.Google ScholarGoogle Scholar
  15. Laguna, I., Ahn, D.H., de Supinski, B. R., Bagchi, S., and Gamblin, T. Probabilistic diagnosis of performance faults in large-scale parallel applications. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (Minneapolis, MN, Sept. 19--23). ACM Press, New York, 2012, 213--222. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Laguna, I., Gamblin, T., de Supinski, B.R., Bagchi, S., Bronevetsky, G., Ahn, D.H., Schulz, M. and Rountree, B. Large-scale debugging of parallel tasks with <code>AutomaDeD</code>. In Proceedings of 2011 International Conference on High Performance Computing, Networking, Storage, and Analysis (Seattle, WA, Nov. 12--18). ACM Press, New York, 2011, 50:1--50:10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Lee, G.L., Ahn, D.H., Arnold, D.C., de Supinski, B.R., Legendre, M., Miller, B.P., Schulz, M., and Liblit, B. Lessons learned at 208K: Towards debugging millions of Cores. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (Austin, TX, Nov. 15--21). IEEE Press, Piscataway, NJ, 2008, 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Lee, G.L., Ahn, D.H., Arnold, D.C., de Supinski, B.R., Miller, B.P., and Schulz, M. Benchmarking the stack trace analysis tool for BlueGene/L. In Proceedings of the Parallel Computing: Architectures, Algorithms, and Applications Conference (Julich/Aachen, Germany, Sept. 4--7). IOS Press, Amsterdam, the Netherlands, 2007, 621--628.Google ScholarGoogle Scholar
  19. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Version 3.0, Sept. 2012; http://www.mpi-forum.org/docs/Google ScholarGoogle Scholar
  20. Mitra, S., Laguna, I., Ahn, D.H., Bagchi, S., Schulz, M., and Gamblin, T. Accurate application progress analysis for large-scale parallel debugging. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (Edinburgh, U.K., June 9--11). ACM Press, New York, 2014, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Open MPI Project; https://svn.open-mpi.org/trac/ompi/ticket/689.Google ScholarGoogle Scholar
  22. Roth, P.C., Arnold, D.C., and Miller, B.P. MRNet: A software-based multicast/reduction network for scalable tools. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing (Phoenix, AZ, Nov. 15--21). ACM Press, New York, 2003, 21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sistare, S., Allen, D., Bowker, R., Jourdenais, K., Simons, J. et al. A scalable debugger for massively parallel message-passing programs. IEEE Parallel & Distributed Technology: Systems & Applications 2, 2 (Summer 1994), 50--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Vakkalanka, S.S., Sharma, S., Gopalakrishnan, G., and Kirby, R.M. ISP: A tool for model checking MPI programs. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Salt Lake City, UT, Feb. 20--23). ACM Press, New York, 2008, 285--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Vetter, J.S. and de Supinski, B.R. Dynamic software testing of MPI applications with Umpire. In Proceedings of the ACM/IEEE Supercomputing Conference (Dallas, TX, Nov. 4--10). IEEE Press, 2000, 51--51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Weiser, M. Program slicing. In Proceedings of the Fifth International Conference on Software Engineering (San Diego, CA, Mar. 9--12). IEEE Press, Piscataway, NJ, 1981, 439--449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yang, J., Cui, H., Wu, J., Tang, Y., and Hu, G. Making parallel programs reliable with stable multithreading. Commun. ACM 57, 3 (Mar. 2014), 58--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Zhou, B., Kulkarni, M., and Bagchi, S. <code>Vrisha</code>: Using scaling properties of parallel programs for bug detection and localization. In Proceedings of the 20th International ACM Symposium on High-Performance and Distributed Computing (San Jose, CA, June 8--11). ACM Press, New York, 2011, 85--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Zhou, B., Too, J., Kulkarni, M., and Bagchi, S. <code>WuKong</code>: Automatically detecting and localizing bugs that manifest at large system scales. In Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing (New York, June 17--21). ACM Press, New York, 2013, 131--142. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Debugging high-performance computing applications at massive scales

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Communications of the ACM
            Communications of the ACM  Volume 58, Issue 9
            September 2015
            119 pages
            ISSN:0001-0782
            EISSN:1557-7317
            DOI:10.1145/2817191
            • Editor:
            • Moshe Y. Vardi
            Issue’s Table of Contents

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 24 August 2015

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Popular
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDFChinese translation

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format