ABSTRACT
Developers spend a lot of time searching for the root causes of software failures. For this, they traditionally try to reproduce those failures, but unfortunately many failures are so hard to reproduce in a test environment that developers spend days or weeks as ad-hoc detectives. The shortcomings of many solutions proposed for this problem prevent their use in practice.
We propose failure sketching, an automated debugging technique that provides developers with an explanation ("failure sketch") of the root cause of a failure that occurred in production. A failure sketch only contains program statements that lead to the failure, and it clearly shows the differences between failing and successful runs; these differences guide developers to the root cause. Our approach combines static program analysis with a cooperative and adaptive form of dynamic program analysis.
We built Gist, a prototype for failure sketching that relies on hardware watchpoints and a new hardware feature for extracting control flow traces (Intel Processor Trace). We show that Gist can build failure sketches with low overhead for failures in systems like Apache, SQLite, and Memcached.
Supplemental Material
- Abreu, R., Zoeteweij, P., and Gemund, A. J. C. V. An evaluation of similarity coefficients for software fault localization. In PRDC (2006). Google ScholarDigital Library
- Altekar, G., and Stoica, I. ODR: Output-deterministic replay for multicore programs. In Symp. on Operating Systems Principles (2009). Google ScholarDigital Library
- Ammons, G., and Larus, J. R. Improving data-flow analysis with path profiles. In Intl. Conf. on Programming Language Design and Implem. (1994). Google ScholarDigital Library
- Arulraj, J., Chang, P.-C., Jin, G., and Lu, S. Production-run software failure diagnosis via hardware performance counters. In ASPLOS (2013). Google ScholarDigital Library
- Arulraj, J., Jin, G., and Lu, S. Leveraging the short-term memory of hardware to diagnose production-run software failures. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2014). Google ScholarDigital Library
- Arumuga Nainar, P., and Liblit, B. Adaptive bug isolation. In Intl. Conf. on Software Engineering (2010). Google ScholarDigital Library
- Baris Kasikci, Benjamin Schubert, G. C. Gist. http://dslab.epfl.ch/proj/gist/, 2015.Google Scholar
- Beeman Strong. Debug and fine-grain profiling with intel processor trace. http://bit.ly/1xMYbIC, 2014.Google Scholar
- Bruening, D., Garnett, T., and Amarasinghe, S. An infrastructure for adaptive dynamic optimization. In CGO (2003). Google ScholarDigital Library
- Butenhof, D. R. Programming with POSIX Threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997. Google ScholarDigital Library
- Chen, H., Yu, J., Chen, R., Zang, B., and Yew, P.-C. Polus: A powerful live updating system. In ICSE (2007). Google ScholarDigital Library
- Chilimbi, T. M., Liblit, B., Mehra, K., Nori, A. V., and Vaswani, K. HOLMES: Effective statistical debugging via efficient path profiling. In Intl. Conf. on Software Engineering (2009). Google ScholarDigital Library
- Choi, J.-D., and Zeller, A. Isolating failure-inducing thread schedules. In ISSTA (2002). Google ScholarDigital Library
- Dunlap, G. W., Lucchetti, D., Chen, P. M., and Fetterman, M. Execution replay on multiprocessor virtual machines. In Intl. Conf. on Virtual Execution Environments (2008). Google ScholarDigital Library
- Fitzpatrick, B. Memcached. http://memcached.org, 2013.Google Scholar
- Gilchrist, J. Parallel BZIP2. http://compression.ca/pbzip2, 2015.Google Scholar
- Glerum, K., Kinshumann, K., Greenberg, S., Aul, G., Orgovan, V., Nichols, G., Grant, D., Loihle, G., and Hunt, G. Debugging in the (very) large: ten years of implementation and experience. In Symp. on Operating Systems Principles (2009). Google ScholarDigital Library
- Godefroid, P., and Nagappan, N. Concurrency at Microsoft -- An exploratory survey. In Intl. Conf. on Computer Aided Verification (2008).Google Scholar
- Hauswirth, M., and Chilimbi, T. M. Low-overhead memory leak detection using adaptive statistical profiling. In ASPLOS (2004). Google ScholarDigital Library
- Hower, D. R., and Hill, M. D. Rerun: Exploiting episodes for lightweight memory race recording. ISCA. Google ScholarDigital Library
- Apache httpd. http://httpd.apache.org, 2013.Google Scholar
- Intel. Intel 64 and IA-32 Architectures Software Developer's Manual, vol. 2. 2015.Google Scholar
- Intel Corporation. Intel processor trace. https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing, 2013.Google Scholar
- Linux branch with intel pt support. https://github.com/virtuoso/linux-perf/tree/intel_pt, 2015.Google Scholar
- Jin, G., Thakur, A., Liblit, B., and Lu, S. Instrumentation and sampling strategies for cooperative concurrency bug isolation. SIGPLAN Not. (2010). Google ScholarDigital Library
- Jin, G., Zhang, W., Deng, D., Liblit, B., and Lu, S. Automated concurrency-bug fixing. In OSDI (2012). Google ScholarDigital Library
- John Criswell. The information flow compiler. https://llvm.org/svn/llvm-project/giri/, 2011.Google Scholar
- Jones, J. A., and Harrold, M. J. Empirical evaluation of the tarantula automatic fault-localization technique. In ASE (2005). Google ScholarDigital Library
- Kasikci, B., Ball, T., Candea, G., Erickson, J., and Musuvathi, M. Efficient tracing of cold code via bias-free sampling. In USENIX Annual Technical Conf. (2014). Google ScholarDigital Library
- Kasikci, B., Pereira, C., Pokam, G., Schubert, B., Musuvathi, M., and Candea, G. Failure sketches: A better way to debug. In Workshop on Hot Topics in Operating Systems (2015). Google ScholarDigital Library
- Kasikci, B., Zamfir, C., and Candea, G. Data races vs. data race bugs: Telling the difference with Portend. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2012). Google ScholarDigital Library
- Kasikci, B., Zamfir, C., and Candea, G. Race-Mob: Crowdsourced data race detection. In Symp. on Operating Systems Principles (2013). Google ScholarDigital Library
- Kendall, M. G. A new measure of rank correlation. Biometrika (1938).Google Scholar
- Kleen, A. simple-pt linux driver. https://github.com/andikleen/simple-pt, 2015.Google Scholar
- Lattner, C. Macroscopic Data Structure Analysis and Optimization. PhD thesis, University of Illinois at Urbana-Champaign, May 2005. Google ScholarDigital Library
- Lattner, C., and Adve, V. LLVM: A compilation framework for lifelong program analysis and transformation. In Intl. Symp. on Code Generation and Optimization (2004). Google ScholarDigital Library
- Liblit, B., Naik, M., Zheng, A. X., Aiken, A., and Jordan, M. I. Scalable statistical bug isolation. In Intl. Conf. on Programming Language Design and Implem. (2005). Google ScholarDigital Library
- Liblit, B. R. Cooperative Bug Isolation. PhD thesis, University of California, Berkeley, Dec. 2004. Google ScholarDigital Library
- Lu, S., Park, S., Seo, E., and Zhou, Y. Learning from mistakes -- A comprehensive study on real world concurrency bug characteristics. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2008). Google ScholarDigital Library
- Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. PIN: building customized program analysis tools with dynamic instrumentation. In Intl. Conf. on Programming Language Design and Implem. (2005). Google ScholarDigital Library
- Machado, N., Lucia, B., and Rodrigues, L. Concurrency debugging with differential schedule projections. In PLDI (2015). Google ScholarDigital Library
- Manevich, R., Sridharan, M., Adams, S., Das, M., and Yang, Z. PSE: explaining program failures via postmortem static analysis. In Symp. on the Foundations of Software Eng. (2004). Google ScholarDigital Library
- Marjamãdki, D. Cppcheck. http://cppcheck.sourceforge.net/, 2015.Google Scholar
- McConnell, S. Code Complete. Microsoft Press, 2004.Google Scholar
- Miller, B. P., Callaghan, M. D., Cargille, J. M., Hollingsworth, J. K., Irvin, R. B., Karavanic, K. L., Kunchithapadam, K., and Newhall, T. The paradyn parallel performance measurement tool. Computer (1995). Google ScholarDigital Library
- Montesinos, P., Ceze, L., and Torrellas, J. Delorean: Recording and deterministically replaying shared-memory multiprocessor execution efficiently. ISCA (2008). Google ScholarDigital Library
- Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P. A., and Neamtiu, I. Finding and reproducing Heisenbugs in concurrent programs. In Symp. on Operating Sys. Design and Implem. (2008). Google ScholarDigital Library
- Novark, G., Berger, E. D., and Zorn, B. G. Exterminator: Automatically correcting memory errors with high probability. In Intl. Conf. on Programming Language Design and Implem. (2007). Google ScholarDigital Library
- Papamarcos, M. S., and Patel, J. H. A low-overhead coherence solution for multiprocessors with private cache memories. In ISCA (1984). Google ScholarDigital Library
- Park, S., Xiong, W., Yin, Z., Kaushik, R., Lee, K. H., Lu, S., and Zhou, Y. PRES: Probabilistic replay with execution sketching on multiprocessors. In Symp. on Operating Systems Principles (2009). Google ScholarDigital Library
- Perkins, J. H., Kim, S., Larsen, S., Amarasinghe, S., Bachrach, J., Carbin, M., Pacheco, C., Sherwood, F., Sidiroglou, S., Sullivan, G., Wong, W.-F., Zibin, Y., Ernst, M. D., and Rinard, M. Automatically patching errors in deployed software. In Symp. on Operating Sys. Design and Implem. (2010). Google ScholarDigital Library
- Pokam, G., Pereira, C., Hu, S., Adl-Tabatabai, A.-R., Gottschlich, J., Ha, J., and Wu, Y. Coreracer: A practical memory race recorder for multicore x86 tso processors. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (2011), MICRO-44, ACM, pp. 216--225. Google ScholarDigital Library
- Qin, F., Tucek, J., Zhou, Y., and Sundaresan, J. Rx: Treating bugs as allergies -- a safe method to survive software failures. ACM Transactions on Computer Systems 25, 3 (2007). Google ScholarDigital Library
- Quora. What is a coder's worst nightmware? http://www.quora.com/What-is-a-coders-worst-nightmare.Google Scholar
- Rastislav Bodik, S. A. Path-sensitive value-flow analysis. In Symp. on Principles of Programming Languages (1998). Google ScholarDigital Library
- Rijsbergen, C. J. V. Information Retrieval. Butterworth-Heinemann, 1979. Google ScholarDigital Library
- Sadowski, C., and Yi, J. How developers use data race detection tools. In Proceedings of the 5th Workshop on Evaluation and Usability of Programming Languages and Tools (2014), PLATEAU. Google ScholarDigital Library
- Sahoo, S. K., Criswell, J., Geigle, C., and Adve, V. Using likely invariants for automated software fault localization. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2013). Google ScholarDigital Library
- Slaby, J. Llvm slicer. https://github.com/jirislaby/LLVMSlicer/, 2014.Google Scholar
- SQLite. http://www.sqlite.org/, 2013.Google Scholar
- Stenberg, D. Curl bug 965. http://sourceforge.net/p/curl/bugs/965/, 2013.Google Scholar
- Stenberg, D. Curl. http://curl.haxx.se/, 2015.Google Scholar
- Stoddard, B. Apache bug 21287. https://bz.apache.org/bugzilla/show_bug.cgi?id=21287, 2003.Google Scholar
- Sweeney, L. K-Anonymity: A model for protecting privacy. In Intl. Journal on Uncertainty, Fuzziness and Knowledge-based Systems (2002). Google ScholarDigital Library
- The Associated Press. Northeastern blackout bug. http://www.securityfocus.com/news/8032, 2004.Google Scholar
- Transmission. http://www.transmissionbt.com/, 2015.Google Scholar
- Tucek, J., Lu, S., Huang, C., Xanthos, S., and Zhou, Y. Triage: diagnosing production run failures at the user's site. In Symp. on Operating Systems Principles (2007). Google ScholarDigital Library
- Veeraraghavan, K., Lee, D., Wester, B., Ouyang, J., Chen, P. M., Flinn, J., and Narayanasamy, S. Doubleplay: Parallelizing sequential logging and replay. TOCS 30, 1 (2012). Google ScholarDigital Library
- Wang, Y., Patil, H., Pereira, C., Lueck, G., Gupta, R., and Neamtiu, I. Drdebug: Deterministic replay based cyclic debugging with dynamic slicing. In CGO (2014). Google ScholarDigital Library
- Weidendorfer, J. Kcachegrind. http://kcachegrind.sourceforge.net/html/Home.html, 2015.Google Scholar
- Weining Gu, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Zhen-Yu Yang. Characterization of linux kernel behavior under errors, 2003.Google ScholarCross Ref
- Weiser, M. Program slicing. In Intl. Conf. on Software Engineering (1981). Google ScholarDigital Library
- Wheeler, D. SLOCCount. http://www.dwheeler.com/sloccount/, 2010.Google Scholar
- Wilson, P. F., Dell, L. D., and Anderson, G. F. Root Cause Analysis : A Tool for Total Quality Management. American Society for Quality, 1993.Google Scholar
- Wu, J., Cui, H., and Yang, J. Bypassing races in live applications with execution filters. In Symp. on Operating Sys. Design and Implem. (2010). Google ScholarDigital Library
- Xin, B., Sumner, W. N., and Zhang, X. Efficient program execution indexing. In Intl. Conf. on Programming Language Design and Implem. (2008). Google ScholarDigital Library
- Yu, J., and Narayanasamy, S. A case for an interleaving constrained shared-memory multi-processor. In Intl. Symp. on Computer Architecture (2009). Google ScholarDigital Library
- Yu, Y., Rodeheffer, T., and Chen, W. Racetrack: Efficient detection of data race conditions via adaptive tracking. In Symp. on Operating Systems Principles (2005). Google ScholarDigital Library
- Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., and Pasupathy, S. SherLog: error diagnosis by connecting clues from run-time logs. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2010). Google ScholarDigital Library
- Zamfir, C., Altekar, G., Candea, G., and Stoica, I. Debug determinism: The sweet spot for replay-based debugging. In Workshop on Hot Topics in Operating Systems (2011). Google ScholarDigital Library
- Zeller, A., and Hildebrandt, R. Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering (2002). Google ScholarDigital Library
- Zhang, W., Lim, J., Olichandran, R., Scherpelz, J., Jin, G., Lu, S., and Reps, T. ConSeq: Detecting concurrency bugs through sequential errors. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2011). Google ScholarDigital Library
Index Terms
- Failure sketching: a technique for automated root cause diagnosis of in-production failures
Recommendations
The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems PrinciplesThe end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a ...
Lazy Diagnosis of In-Production Concurrency Bugs
SOSP '17: Proceedings of the 26th Symposium on Operating Systems PrinciplesDiagnosing concurrency bugs---the process of understanding the root causes of concurrency failures---is hard. Developers depend on reproducing concurrency bugs to diagnose them. Traditionally, systems that attempt to reproduce concurrency bugs record ...
Pivot tracing: dynamic causal monitoring for distributed systems
SOSP '15: Proceedings of the 25th Symposium on Operating Systems PrinciplesMonitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today -- logs, counters, and metrics -- have two important ...
Comments