skip to main content
10.1145/2815400.2815412acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Failure sketching: a technique for automated root cause diagnosis of in-production failures

Published:04 October 2015Publication History

ABSTRACT

Developers spend a lot of time searching for the root causes of software failures. For this, they traditionally try to reproduce those failures, but unfortunately many failures are so hard to reproduce in a test environment that developers spend days or weeks as ad-hoc detectives. The shortcomings of many solutions proposed for this problem prevent their use in practice.

We propose failure sketching, an automated debugging technique that provides developers with an explanation ("failure sketch") of the root cause of a failure that occurred in production. A failure sketch only contains program statements that lead to the failure, and it clearly shows the differences between failing and successful runs; these differences guide developers to the root cause. Our approach combines static program analysis with a cooperative and adaptive form of dynamic program analysis.

We built Gist, a prototype for failure sketching that relies on hardware watchpoints and a new hardware feature for extracting control flow traces (Intel Processor Trace). We show that Gist can build failure sketches with low overhead for failures in systems like Apache, SQLite, and Memcached.

Skip Supplemental Material Section

Supplemental Material

p344.mp4

mp4

2.1 GB

References

  1. Abreu, R., Zoeteweij, P., and Gemund, A. J. C. V. An evaluation of similarity coefficients for software fault localization. In PRDC (2006). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Altekar, G., and Stoica, I. ODR: Output-deterministic replay for multicore programs. In Symp. on Operating Systems Principles (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Ammons, G., and Larus, J. R. Improving data-flow analysis with path profiles. In Intl. Conf. on Programming Language Design and Implem. (1994). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Arulraj, J., Chang, P.-C., Jin, G., and Lu, S. Production-run software failure diagnosis via hardware performance counters. In ASPLOS (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Arulraj, J., Jin, G., and Lu, S. Leveraging the short-term memory of hardware to diagnose production-run software failures. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Arumuga Nainar, P., and Liblit, B. Adaptive bug isolation. In Intl. Conf. on Software Engineering (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Baris Kasikci, Benjamin Schubert, G. C. Gist. http://dslab.epfl.ch/proj/gist/, 2015.Google ScholarGoogle Scholar
  8. Beeman Strong. Debug and fine-grain profiling with intel processor trace. http://bit.ly/1xMYbIC, 2014.Google ScholarGoogle Scholar
  9. Bruening, D., Garnett, T., and Amarasinghe, S. An infrastructure for adaptive dynamic optimization. In CGO (2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Butenhof, D. R. Programming with POSIX Threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Chen, H., Yu, J., Chen, R., Zang, B., and Yew, P.-C. Polus: A powerful live updating system. In ICSE (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chilimbi, T. M., Liblit, B., Mehra, K., Nori, A. V., and Vaswani, K. HOLMES: Effective statistical debugging via efficient path profiling. In Intl. Conf. on Software Engineering (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Choi, J.-D., and Zeller, A. Isolating failure-inducing thread schedules. In ISSTA (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Dunlap, G. W., Lucchetti, D., Chen, P. M., and Fetterman, M. Execution replay on multiprocessor virtual machines. In Intl. Conf. on Virtual Execution Environments (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Fitzpatrick, B. Memcached. http://memcached.org, 2013.Google ScholarGoogle Scholar
  16. Gilchrist, J. Parallel BZIP2. http://compression.ca/pbzip2, 2015.Google ScholarGoogle Scholar
  17. Glerum, K., Kinshumann, K., Greenberg, S., Aul, G., Orgovan, V., Nichols, G., Grant, D., Loihle, G., and Hunt, G. Debugging in the (very) large: ten years of implementation and experience. In Symp. on Operating Systems Principles (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Godefroid, P., and Nagappan, N. Concurrency at Microsoft -- An exploratory survey. In Intl. Conf. on Computer Aided Verification (2008).Google ScholarGoogle Scholar
  19. Hauswirth, M., and Chilimbi, T. M. Low-overhead memory leak detection using adaptive statistical profiling. In ASPLOS (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hower, D. R., and Hill, M. D. Rerun: Exploiting episodes for lightweight memory race recording. ISCA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Apache httpd. http://httpd.apache.org, 2013.Google ScholarGoogle Scholar
  22. Intel. Intel 64 and IA-32 Architectures Software Developer's Manual, vol. 2. 2015.Google ScholarGoogle Scholar
  23. Intel Corporation. Intel processor trace. https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing, 2013.Google ScholarGoogle Scholar
  24. Linux branch with intel pt support. https://github.com/virtuoso/linux-perf/tree/intel_pt, 2015.Google ScholarGoogle Scholar
  25. Jin, G., Thakur, A., Liblit, B., and Lu, S. Instrumentation and sampling strategies for cooperative concurrency bug isolation. SIGPLAN Not. (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jin, G., Zhang, W., Deng, D., Liblit, B., and Lu, S. Automated concurrency-bug fixing. In OSDI (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. John Criswell. The information flow compiler. https://llvm.org/svn/llvm-project/giri/, 2011.Google ScholarGoogle Scholar
  28. Jones, J. A., and Harrold, M. J. Empirical evaluation of the tarantula automatic fault-localization technique. In ASE (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Kasikci, B., Ball, T., Candea, G., Erickson, J., and Musuvathi, M. Efficient tracing of cold code via bias-free sampling. In USENIX Annual Technical Conf. (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Kasikci, B., Pereira, C., Pokam, G., Schubert, B., Musuvathi, M., and Candea, G. Failure sketches: A better way to debug. In Workshop on Hot Topics in Operating Systems (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Kasikci, B., Zamfir, C., and Candea, G. Data races vs. data race bugs: Telling the difference with Portend. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Kasikci, B., Zamfir, C., and Candea, G. Race-Mob: Crowdsourced data race detection. In Symp. on Operating Systems Principles (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kendall, M. G. A new measure of rank correlation. Biometrika (1938).Google ScholarGoogle Scholar
  34. Kleen, A. simple-pt linux driver. https://github.com/andikleen/simple-pt, 2015.Google ScholarGoogle Scholar
  35. Lattner, C. Macroscopic Data Structure Analysis and Optimization. PhD thesis, University of Illinois at Urbana-Champaign, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Lattner, C., and Adve, V. LLVM: A compilation framework for lifelong program analysis and transformation. In Intl. Symp. on Code Generation and Optimization (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Liblit, B., Naik, M., Zheng, A. X., Aiken, A., and Jordan, M. I. Scalable statistical bug isolation. In Intl. Conf. on Programming Language Design and Implem. (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Liblit, B. R. Cooperative Bug Isolation. PhD thesis, University of California, Berkeley, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lu, S., Park, S., Seo, E., and Zhou, Y. Learning from mistakes -- A comprehensive study on real world concurrency bug characteristics. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. PIN: building customized program analysis tools with dynamic instrumentation. In Intl. Conf. on Programming Language Design and Implem. (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Machado, N., Lucia, B., and Rodrigues, L. Concurrency debugging with differential schedule projections. In PLDI (2015). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Manevich, R., Sridharan, M., Adams, S., Das, M., and Yang, Z. PSE: explaining program failures via postmortem static analysis. In Symp. on the Foundations of Software Eng. (2004). Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Marjamãdki, D. Cppcheck. http://cppcheck.sourceforge.net/, 2015.Google ScholarGoogle Scholar
  44. McConnell, S. Code Complete. Microsoft Press, 2004.Google ScholarGoogle Scholar
  45. Miller, B. P., Callaghan, M. D., Cargille, J. M., Hollingsworth, J. K., Irvin, R. B., Karavanic, K. L., Kunchithapadam, K., and Newhall, T. The paradyn parallel performance measurement tool. Computer (1995). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Montesinos, P., Ceze, L., and Torrellas, J. Delorean: Recording and deterministically replaying shared-memory multiprocessor execution efficiently. ISCA (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P. A., and Neamtiu, I. Finding and reproducing Heisenbugs in concurrent programs. In Symp. on Operating Sys. Design and Implem. (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Novark, G., Berger, E. D., and Zorn, B. G. Exterminator: Automatically correcting memory errors with high probability. In Intl. Conf. on Programming Language Design and Implem. (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Papamarcos, M. S., and Patel, J. H. A low-overhead coherence solution for multiprocessors with private cache memories. In ISCA (1984). Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Park, S., Xiong, W., Yin, Z., Kaushik, R., Lee, K. H., Lu, S., and Zhou, Y. PRES: Probabilistic replay with execution sketching on multiprocessors. In Symp. on Operating Systems Principles (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Perkins, J. H., Kim, S., Larsen, S., Amarasinghe, S., Bachrach, J., Carbin, M., Pacheco, C., Sherwood, F., Sidiroglou, S., Sullivan, G., Wong, W.-F., Zibin, Y., Ernst, M. D., and Rinard, M. Automatically patching errors in deployed software. In Symp. on Operating Sys. Design and Implem. (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Pokam, G., Pereira, C., Hu, S., Adl-Tabatabai, A.-R., Gottschlich, J., Ha, J., and Wu, Y. Coreracer: A practical memory race recorder for multicore x86 tso processors. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (2011), MICRO-44, ACM, pp. 216--225. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Qin, F., Tucek, J., Zhou, Y., and Sundaresan, J. Rx: Treating bugs as allergies -- a safe method to survive software failures. ACM Transactions on Computer Systems 25, 3 (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Quora. What is a coder's worst nightmware? http://www.quora.com/What-is-a-coders-worst-nightmare.Google ScholarGoogle Scholar
  55. Rastislav Bodik, S. A. Path-sensitive value-flow analysis. In Symp. on Principles of Programming Languages (1998). Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Rijsbergen, C. J. V. Information Retrieval. Butterworth-Heinemann, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Sadowski, C., and Yi, J. How developers use data race detection tools. In Proceedings of the 5th Workshop on Evaluation and Usability of Programming Languages and Tools (2014), PLATEAU. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Sahoo, S. K., Criswell, J., Geigle, C., and Adve, V. Using likely invariants for automated software fault localization. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Slaby, J. Llvm slicer. https://github.com/jirislaby/LLVMSlicer/, 2014.Google ScholarGoogle Scholar
  60. SQLite. http://www.sqlite.org/, 2013.Google ScholarGoogle Scholar
  61. Stenberg, D. Curl bug 965. http://sourceforge.net/p/curl/bugs/965/, 2013.Google ScholarGoogle Scholar
  62. Stenberg, D. Curl. http://curl.haxx.se/, 2015.Google ScholarGoogle Scholar
  63. Stoddard, B. Apache bug 21287. https://bz.apache.org/bugzilla/show_bug.cgi?id=21287, 2003.Google ScholarGoogle Scholar
  64. Sweeney, L. K-Anonymity: A model for protecting privacy. In Intl. Journal on Uncertainty, Fuzziness and Knowledge-based Systems (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. The Associated Press. Northeastern blackout bug. http://www.securityfocus.com/news/8032, 2004.Google ScholarGoogle Scholar
  66. Transmission. http://www.transmissionbt.com/, 2015.Google ScholarGoogle Scholar
  67. Tucek, J., Lu, S., Huang, C., Xanthos, S., and Zhou, Y. Triage: diagnosing production run failures at the user's site. In Symp. on Operating Systems Principles (2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Veeraraghavan, K., Lee, D., Wester, B., Ouyang, J., Chen, P. M., Flinn, J., and Narayanasamy, S. Doubleplay: Parallelizing sequential logging and replay. TOCS 30, 1 (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Wang, Y., Patil, H., Pereira, C., Lueck, G., Gupta, R., and Neamtiu, I. Drdebug: Deterministic replay based cyclic debugging with dynamic slicing. In CGO (2014). Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Weidendorfer, J. Kcachegrind. http://kcachegrind.sourceforge.net/html/Home.html, 2015.Google ScholarGoogle Scholar
  71. Weining Gu, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Zhen-Yu Yang. Characterization of linux kernel behavior under errors, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  72. Weiser, M. Program slicing. In Intl. Conf. on Software Engineering (1981). Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Wheeler, D. SLOCCount. http://www.dwheeler.com/sloccount/, 2010.Google ScholarGoogle Scholar
  74. Wilson, P. F., Dell, L. D., and Anderson, G. F. Root Cause Analysis : A Tool for Total Quality Management. American Society for Quality, 1993.Google ScholarGoogle Scholar
  75. Wu, J., Cui, H., and Yang, J. Bypassing races in live applications with execution filters. In Symp. on Operating Sys. Design and Implem. (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Xin, B., Sumner, W. N., and Zhang, X. Efficient program execution indexing. In Intl. Conf. on Programming Language Design and Implem. (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Yu, J., and Narayanasamy, S. A case for an interleaving constrained shared-memory multi-processor. In Intl. Symp. on Computer Architecture (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Yu, Y., Rodeheffer, T., and Chen, W. Racetrack: Efficient detection of data race conditions via adaptive tracking. In Symp. on Operating Systems Principles (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., and Pasupathy, S. SherLog: error diagnosis by connecting clues from run-time logs. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Zamfir, C., Altekar, G., Candea, G., and Stoica, I. Debug determinism: The sweet spot for replay-based debugging. In Workshop on Hot Topics in Operating Systems (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Zeller, A., and Hildebrandt, R. Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Zhang, W., Lim, J., Olichandran, R., Scherpelz, J., Jin, G., Lu, S., and Reps, T. ConSeq: Detecting concurrency bugs through sequential errors. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Failure sketching: a technique for automated root cause diagnosis of in-production failures

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles
                  October 2015
                  499 pages
                  ISBN:9781450338349
                  DOI:10.1145/2815400

                  Copyright © 2015 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 4 October 2015

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article

                  Acceptance Rates

                  SOSP '15 Paper Acceptance Rate30of181submissions,17%Overall Acceptance Rate131of716submissions,18%

                  Upcoming Conference

                  SOSP '24

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader