Abstract
Dynamic analysis techniques help programmers find the root cause of bugs in large-scale parallel applications.
- Ahn, D.H., Arnold, D.C., de Supinski, B.R., Lee, G.L., Miller, B.P., and Schulz, M. Overcoming scalability challenges for tool daemon launching. In Proceedings of the International Conference on Parallel Processing (Portland, OR, Sept. 8--12). IEEE Press, 2008, 578--585. Google ScholarDigital Library
- Arnold, D.C., Ahn, D.H., de Supinski, B.R., Lee, G.L., Miller, B.P., and Schulz, M. Stack trace analysis for large-scale debugging. In Proceedings of the International Parallel and Distributed Processing Symposium (Long Beach, CA, Mar. 26--30). IEEE Press, 2007, pages 1--10.Google ScholarCross Ref
- Bronevetsky, G., Laguna, I., Bagchi, S., de Supinski, B.R., Ahn, D.H., and Schulz, M. <code>AutomaDeD:</code> Automata-based debugging for dissimilar parallel tasks. In Proceedings of the 2010 IEEE/IFIP International Conference on Dependable Systems and Networks (Chicago, IL, June 28--July 1). IEEE Press, 2010, 231--240.Google ScholarCross Ref
- Cadar, C. and Sen, K. Symbolic execution for software testing: three decades later. Commun. ACM 56, 2 (Feb. 2013), 82--90. Google ScholarDigital Library
- Chen, Z., Gao, Q., Zhang, W., and Qin, F. <code>FlowChecker</code>: Detecting bugs in MPI libraries via message flow checking. In Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (New Orleans, LA, Nov. 13--19). IEEE Computer Society, Washington, D.C., 2010, 1--11. Google ScholarDigital Library
- Dinh, M.N., Abramson, D., and Jin, C. Scalable relative debugging. IEEE Transactions on Parallel and Distributed Systems 25, 3 (Mar. 2014), 740--749. Google ScholarDigital Library
- Gamblin, T., De Supinski, B.R., Schulz, M., Fowler, R., and Reed, D.A. Clustering performance data efficiently at massive scales. In Proceedings of the 24th ACM International Conference on Supercomputing (Tsukuba, Ibaraki, Japan, June 1--4). ACM Press, New York, 2010, 243--252. Google ScholarDigital Library
- Gao, Q., Qin, F., and Panda, D.K. DMTracker: Finding bugs in large-scale parallel programs by detecting anomaly in data movements. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (Reno, NV, Nov. 10--16). ACM Press, New York, 2007, 15:1--15:12. Google ScholarDigital Library
- Gopalakrishnan, G., Kirby, R.M., Siegel, S., Thakur, R., Gropp, W., Lusk, E., De Supinski, B.R., Schulz, M., and Bronevetsky, G. Formal analysis of MPI-based parallel programs. Commun. ACM 54, 12 (Dec. 2011), 82--91. Google ScholarDigital Library
- Gropp, W., Lusk, E., Doss, N., and Skjellum, A. A high-performance, portable implementation of the MPI message-passing interface standard. Parallel Computing 22, 6 (1996), 789--828. Google ScholarDigital Library
- Hilbrich, T., Schulz, M., de Supinski, B.R., and Müller, M.S. MUST: A scalable approach to runtime error detection in MPI programs. Chapter 5 of Tools for High Performance Computing 2009, M.S. Müller et al., Eds. Springer, Berlin, Heidelberg, 2010, 53--66.Google Scholar
- Kieras, D.E., Meyer, D.E., Ballas, J.A., and Lauber, E.J. Modern computational perspectives on executive mental processes and cognitive control: Where to from here? Chapter 30 of Control of Cognitive Processes: Attention and Performance, S. Monsell and J. Driver, Eds. MIT Press, Cambridge, MA, 2000, 681--712.Google Scholar
- Kinshumann, K., Glerum, K., Greenberg, S., Aul, G., Orgovan, V., Nichols, G., Grant, G., Loihle, G., and Hunt, G. Debugging in the (very) large: 10 years of implementation and experience. Commun. ACM 54, 7 (July 2011), 111--116. Google ScholarDigital Library
- Krammer, B., Müller, M.S., and Resch, M.M. MPI application development using the analysis tool MARMOT. In Proceedings of the Fourth International Conference on Computational Science, M. Bubak et al., Eds. (Kraków, Poland, June 6--9). Springer, Berlin, Heidelberg, 2004, 464--471.Google Scholar
- Laguna, I., Ahn, D.H., de Supinski, B. R., Bagchi, S., and Gamblin, T. Probabilistic diagnosis of performance faults in large-scale parallel applications. In Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (Minneapolis, MN, Sept. 19--23). ACM Press, New York, 2012, 213--222. Google ScholarDigital Library
- Laguna, I., Gamblin, T., de Supinski, B.R., Bagchi, S., Bronevetsky, G., Ahn, D.H., Schulz, M. and Rountree, B. Large-scale debugging of parallel tasks with <code>AutomaDeD</code>. In Proceedings of 2011 International Conference on High Performance Computing, Networking, Storage, and Analysis (Seattle, WA, Nov. 12--18). ACM Press, New York, 2011, 50:1--50:10. Google ScholarDigital Library
- Lee, G.L., Ahn, D.H., Arnold, D.C., de Supinski, B.R., Legendre, M., Miller, B.P., Schulz, M., and Liblit, B. Lessons learned at 208K: Towards debugging millions of Cores. In Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage, and Analysis (Austin, TX, Nov. 15--21). IEEE Press, Piscataway, NJ, 2008, 1--9. Google ScholarDigital Library
- Lee, G.L., Ahn, D.H., Arnold, D.C., de Supinski, B.R., Miller, B.P., and Schulz, M. Benchmarking the stack trace analysis tool for BlueGene/L. In Proceedings of the Parallel Computing: Architectures, Algorithms, and Applications Conference (Julich/Aachen, Germany, Sept. 4--7). IOS Press, Amsterdam, the Netherlands, 2007, 621--628.Google Scholar
- Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Version 3.0, Sept. 2012; http://www.mpi-forum.org/docs/Google Scholar
- Mitra, S., Laguna, I., Ahn, D.H., Bagchi, S., Schulz, M., and Gamblin, T. Accurate application progress analysis for large-scale parallel debugging. In Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation (Edinburgh, U.K., June 9--11). ACM Press, New York, 2014, 1--10. Google ScholarDigital Library
- Open MPI Project; https://svn.open-mpi.org/trac/ompi/ticket/689.Google Scholar
- Roth, P.C., Arnold, D.C., and Miller, B.P. MRNet: A software-based multicast/reduction network for scalable tools. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing (Phoenix, AZ, Nov. 15--21). ACM Press, New York, 2003, 21. Google ScholarDigital Library
- Sistare, S., Allen, D., Bowker, R., Jourdenais, K., Simons, J. et al. A scalable debugger for massively parallel message-passing programs. IEEE Parallel & Distributed Technology: Systems & Applications 2, 2 (Summer 1994), 50--56. Google ScholarDigital Library
- Vakkalanka, S.S., Sharma, S., Gopalakrishnan, G., and Kirby, R.M. ISP: A tool for model checking MPI programs. In Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (Salt Lake City, UT, Feb. 20--23). ACM Press, New York, 2008, 285--286. Google ScholarDigital Library
- Vetter, J.S. and de Supinski, B.R. Dynamic software testing of MPI applications with Umpire. In Proceedings of the ACM/IEEE Supercomputing Conference (Dallas, TX, Nov. 4--10). IEEE Press, 2000, 51--51. Google ScholarDigital Library
- Weiser, M. Program slicing. In Proceedings of the Fifth International Conference on Software Engineering (San Diego, CA, Mar. 9--12). IEEE Press, Piscataway, NJ, 1981, 439--449. Google ScholarDigital Library
- Yang, J., Cui, H., Wu, J., Tang, Y., and Hu, G. Making parallel programs reliable with stable multithreading. Commun. ACM 57, 3 (Mar. 2014), 58--69. Google ScholarDigital Library
- Zhou, B., Kulkarni, M., and Bagchi, S. <code>Vrisha</code>: Using scaling properties of parallel programs for bug detection and localization. In Proceedings of the 20th International ACM Symposium on High-Performance and Distributed Computing (San Jose, CA, June 8--11). ACM Press, New York, 2011, 85--96. Google ScholarDigital Library
- Zhou, B., Too, J., Kulkarni, M., and Bagchi, S. <code>WuKong</code>: Automatically detecting and localizing bugs that manifest at large system scales. In Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing (New York, June 17--21). ACM Press, New York, 2013, 131--142. Google ScholarDigital Library
Index Terms
- Debugging high-performance computing applications at massive scales
Recommendations
A Platform-Specific Code Smell Alert System for High Performance Computing Applications
IPDPSW '14: Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium WorkshopsA code smell is any part of an application code that might indicate a code or design problem, which makes the application code hard to evolve and maintain. Automatic detection of code smells has been studied to help programmers find which parts of their ...
Comments