research-article

Failure sketching: a technique for automated root cause diagnosis of in-production failures

Authors:
Baris Kasikci

Ecole Polytechnique Fédérale de Lausanne (EPFL)

Ecole Polytechnique Fédérale de Lausanne (EPFL)
View Profile

,
Benjamin Schubert

Ecole Polytechnique Fédérale de Lausanne (EPFL)

Ecole Polytechnique Fédérale de Lausanne (EPFL)
View Profile

,
Cristiano Pereira

Intel Corporation

Intel Corporation
View Profile

,
Gilles Pokam

Intel Corporation

Intel Corporation
View Profile

,
George Candea

Ecole Polytechnique Fédérale de Lausanne (EPFL)

Ecole Polytechnique Fédérale de Lausanne (EPFL)
View Profile

SOSP '15: Proceedings of the 25th Symposium on Operating Systems PrinciplesOctober 2015Pages 344–360https://doi.org/10.1145/2815400.2815412

Published:04 October 2015Publication History

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

Pages 344–360

ABSTRACT

Developers spend a lot of time searching for the root causes of software failures. For this, they traditionally try to reproduce those failures, but unfortunately many failures are so hard to reproduce in a test environment that developers spend days or weeks as ad-hoc detectives. The shortcomings of many solutions proposed for this problem prevent their use in practice.

We propose failure sketching, an automated debugging technique that provides developers with an explanation ("failure sketch") of the root cause of a failure that occurred in production. A failure sketch only contains program statements that lead to the failure, and it clearly shows the differences between failing and successful runs; these differences guide developers to the root cause. Our approach combines static program analysis with a cooperative and adaptive form of dynamic program analysis.

We built Gist, a prototype for failure sketching that relies on hardware watchpoints and a new hardware feature for extracting control flow traces (Intel Processor Trace). We show that Gist can build failure sketches with low overhead for failures in systems like Apache, SQLite, and Memcached.

Supplemental Material

p344.mp4

mp4

2.1 GB

Download

References

Abreu, R., Zoeteweij, P., and Gemund, A. J. C. V. An evaluation of similarity coefficients for software fault localization. In PRDC (2006). Google ScholarDigital Library
Altekar, G., and Stoica, I. ODR: Output-deterministic replay for multicore programs. In Symp. on Operating Systems Principles (2009). Google ScholarDigital Library
Ammons, G., and Larus, J. R. Improving data-flow analysis with path profiles. In Intl. Conf. on Programming Language Design and Implem. (1994). Google ScholarDigital Library
Arulraj, J., Chang, P.-C., Jin, G., and Lu, S. Production-run software failure diagnosis via hardware performance counters. In ASPLOS (2013). Google ScholarDigital Library
Arulraj, J., Jin, G., and Lu, S. Leveraging the short-term memory of hardware to diagnose production-run software failures. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2014). Google ScholarDigital Library
Arumuga Nainar, P., and Liblit, B. Adaptive bug isolation. In Intl. Conf. on Software Engineering (2010). Google ScholarDigital Library
Baris Kasikci, Benjamin Schubert, G. C. Gist. http://dslab.epfl.ch/proj/gist/, 2015.Google Scholar
Beeman Strong. Debug and fine-grain profiling with intel processor trace. http://bit.ly/1xMYbIC, 2014.Google Scholar
Bruening, D., Garnett, T., and Amarasinghe, S. An infrastructure for adaptive dynamic optimization. In CGO (2003). Google ScholarDigital Library
Butenhof, D. R. Programming with POSIX Threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997. Google ScholarDigital Library
Chen, H., Yu, J., Chen, R., Zang, B., and Yew, P.-C. Polus: A powerful live updating system. In ICSE (2007). Google ScholarDigital Library
Chilimbi, T. M., Liblit, B., Mehra, K., Nori, A. V., and Vaswani, K. HOLMES: Effective statistical debugging via efficient path profiling. In Intl. Conf. on Software Engineering (2009). Google ScholarDigital Library
Choi, J.-D., and Zeller, A. Isolating failure-inducing thread schedules. In ISSTA (2002). Google ScholarDigital Library
Dunlap, G. W., Lucchetti, D., Chen, P. M., and Fetterman, M. Execution replay on multiprocessor virtual machines. In Intl. Conf. on Virtual Execution Environments (2008). Google ScholarDigital Library
Fitzpatrick, B. Memcached. http://memcached.org, 2013.Google Scholar
Gilchrist, J. Parallel BZIP2. http://compression.ca/pbzip2, 2015.Google Scholar
Glerum, K., Kinshumann, K., Greenberg, S., Aul, G., Orgovan, V., Nichols, G., Grant, D., Loihle, G., and Hunt, G. Debugging in the (very) large: ten years of implementation and experience. In Symp. on Operating Systems Principles (2009). Google ScholarDigital Library
Godefroid, P., and Nagappan, N. Concurrency at Microsoft -- An exploratory survey. In Intl. Conf. on Computer Aided Verification (2008).Google Scholar
Hauswirth, M., and Chilimbi, T. M. Low-overhead memory leak detection using adaptive statistical profiling. In ASPLOS (2004). Google ScholarDigital Library
Hower, D. R., and Hill, M. D. Rerun: Exploiting episodes for lightweight memory race recording. ISCA. Google ScholarDigital Library
Apache httpd. http://httpd.apache.org, 2013.Google Scholar
Intel. Intel 64 and IA-32 Architectures Software Developer's Manual, vol. 2. 2015.Google Scholar
Intel Corporation. Intel processor trace. https://software.intel.com/en-us/blogs/2013/09/18/processor-tracing, 2013.Google Scholar
Linux branch with intel pt support. https://github.com/virtuoso/linux-perf/tree/intel_pt, 2015.Google Scholar
Jin, G., Thakur, A., Liblit, B., and Lu, S. Instrumentation and sampling strategies for cooperative concurrency bug isolation. SIGPLAN Not. (2010). Google ScholarDigital Library
Jin, G., Zhang, W., Deng, D., Liblit, B., and Lu, S. Automated concurrency-bug fixing. In OSDI (2012). Google ScholarDigital Library
John Criswell. The information flow compiler. https://llvm.org/svn/llvm-project/giri/, 2011.Google Scholar
Jones, J. A., and Harrold, M. J. Empirical evaluation of the tarantula automatic fault-localization technique. In ASE (2005). Google ScholarDigital Library
Kasikci, B., Ball, T., Candea, G., Erickson, J., and Musuvathi, M. Efficient tracing of cold code via bias-free sampling. In USENIX Annual Technical Conf. (2014). Google ScholarDigital Library
Kasikci, B., Pereira, C., Pokam, G., Schubert, B., Musuvathi, M., and Candea, G. Failure sketches: A better way to debug. In Workshop on Hot Topics in Operating Systems (2015). Google ScholarDigital Library
Kasikci, B., Zamfir, C., and Candea, G. Data races vs. data race bugs: Telling the difference with Portend. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2012). Google ScholarDigital Library
Kasikci, B., Zamfir, C., and Candea, G. Race-Mob: Crowdsourced data race detection. In Symp. on Operating Systems Principles (2013). Google ScholarDigital Library
Kendall, M. G. A new measure of rank correlation. Biometrika (1938).Google Scholar
Kleen, A. simple-pt linux driver. https://github.com/andikleen/simple-pt, 2015.Google Scholar
Lattner, C. Macroscopic Data Structure Analysis and Optimization. PhD thesis, University of Illinois at Urbana-Champaign, May 2005. Google ScholarDigital Library
Lattner, C., and Adve, V. LLVM: A compilation framework for lifelong program analysis and transformation. In Intl. Symp. on Code Generation and Optimization (2004). Google ScholarDigital Library
Liblit, B., Naik, M., Zheng, A. X., Aiken, A., and Jordan, M. I. Scalable statistical bug isolation. In Intl. Conf. on Programming Language Design and Implem. (2005). Google ScholarDigital Library
Liblit, B. R. Cooperative Bug Isolation. PhD thesis, University of California, Berkeley, Dec. 2004. Google ScholarDigital Library
Lu, S., Park, S., Seo, E., and Zhou, Y. Learning from mistakes -- A comprehensive study on real world concurrency bug characteristics. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2008). Google ScholarDigital Library
Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. PIN: building customized program analysis tools with dynamic instrumentation. In Intl. Conf. on Programming Language Design and Implem. (2005). Google ScholarDigital Library
Machado, N., Lucia, B., and Rodrigues, L. Concurrency debugging with differential schedule projections. In PLDI (2015). Google ScholarDigital Library
Manevich, R., Sridharan, M., Adams, S., Das, M., and Yang, Z. PSE: explaining program failures via postmortem static analysis. In Symp. on the Foundations of Software Eng. (2004). Google ScholarDigital Library
Marjamãdki, D. Cppcheck. http://cppcheck.sourceforge.net/, 2015.Google Scholar
McConnell, S. Code Complete. Microsoft Press, 2004.Google Scholar
Miller, B. P., Callaghan, M. D., Cargille, J. M., Hollingsworth, J. K., Irvin, R. B., Karavanic, K. L., Kunchithapadam, K., and Newhall, T. The paradyn parallel performance measurement tool. Computer (1995). Google ScholarDigital Library
Montesinos, P., Ceze, L., and Torrellas, J. Delorean: Recording and deterministically replaying shared-memory multiprocessor execution efficiently. ISCA (2008). Google ScholarDigital Library
Musuvathi, M., Qadeer, S., Ball, T., Basler, G., Nainar, P. A., and Neamtiu, I. Finding and reproducing Heisenbugs in concurrent programs. In Symp. on Operating Sys. Design and Implem. (2008). Google ScholarDigital Library
Novark, G., Berger, E. D., and Zorn, B. G. Exterminator: Automatically correcting memory errors with high probability. In Intl. Conf. on Programming Language Design and Implem. (2007). Google ScholarDigital Library
Papamarcos, M. S., and Patel, J. H. A low-overhead coherence solution for multiprocessors with private cache memories. In ISCA (1984). Google ScholarDigital Library
Park, S., Xiong, W., Yin, Z., Kaushik, R., Lee, K. H., Lu, S., and Zhou, Y. PRES: Probabilistic replay with execution sketching on multiprocessors. In Symp. on Operating Systems Principles (2009). Google ScholarDigital Library
Perkins, J. H., Kim, S., Larsen, S., Amarasinghe, S., Bachrach, J., Carbin, M., Pacheco, C., Sherwood, F., Sidiroglou, S., Sullivan, G., Wong, W.-F., Zibin, Y., Ernst, M. D., and Rinard, M. Automatically patching errors in deployed software. In Symp. on Operating Sys. Design and Implem. (2010). Google ScholarDigital Library
Pokam, G., Pereira, C., Hu, S., Adl-Tabatabai, A.-R., Gottschlich, J., Ha, J., and Wu, Y. Coreracer: A practical memory race recorder for multicore x86 tso processors. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture (2011), MICRO-44, ACM, pp. 216--225. Google ScholarDigital Library
Qin, F., Tucek, J., Zhou, Y., and Sundaresan, J. Rx: Treating bugs as allergies -- a safe method to survive software failures. ACM Transactions on Computer Systems 25, 3 (2007). Google ScholarDigital Library
Quora. What is a coder's worst nightmware? http://www.quora.com/What-is-a-coders-worst-nightmare.Google Scholar
Rastislav Bodik, S. A. Path-sensitive value-flow analysis. In Symp. on Principles of Programming Languages (1998). Google ScholarDigital Library
Rijsbergen, C. J. V. Information Retrieval. Butterworth-Heinemann, 1979. Google ScholarDigital Library
Sadowski, C., and Yi, J. How developers use data race detection tools. In Proceedings of the 5th Workshop on Evaluation and Usability of Programming Languages and Tools (2014), PLATEAU. Google ScholarDigital Library
Sahoo, S. K., Criswell, J., Geigle, C., and Adve, V. Using likely invariants for automated software fault localization. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2013). Google ScholarDigital Library
Slaby, J. Llvm slicer. https://github.com/jirislaby/LLVMSlicer/, 2014.Google Scholar
SQLite. http://www.sqlite.org/, 2013.Google Scholar
Stenberg, D. Curl bug 965. http://sourceforge.net/p/curl/bugs/965/, 2013.Google Scholar
Stenberg, D. Curl. http://curl.haxx.se/, 2015.Google Scholar
Stoddard, B. Apache bug 21287. https://bz.apache.org/bugzilla/show_bug.cgi?id=21287, 2003.Google Scholar
Sweeney, L. K-Anonymity: A model for protecting privacy. In Intl. Journal on Uncertainty, Fuzziness and Knowledge-based Systems (2002). Google ScholarDigital Library
The Associated Press. Northeastern blackout bug. http://www.securityfocus.com/news/8032, 2004.Google Scholar
Transmission. http://www.transmissionbt.com/, 2015.Google Scholar
Tucek, J., Lu, S., Huang, C., Xanthos, S., and Zhou, Y. Triage: diagnosing production run failures at the user's site. In Symp. on Operating Systems Principles (2007). Google ScholarDigital Library
Veeraraghavan, K., Lee, D., Wester, B., Ouyang, J., Chen, P. M., Flinn, J., and Narayanasamy, S. Doubleplay: Parallelizing sequential logging and replay. TOCS 30, 1 (2012). Google ScholarDigital Library
Wang, Y., Patil, H., Pereira, C., Lueck, G., Gupta, R., and Neamtiu, I. Drdebug: Deterministic replay based cyclic debugging with dynamic slicing. In CGO (2014). Google ScholarDigital Library
Weidendorfer, J. Kcachegrind. http://kcachegrind.sourceforge.net/html/Home.html, 2015.Google Scholar
Weining Gu, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Zhen-Yu Yang. Characterization of linux kernel behavior under errors, 2003.Google ScholarCross Ref
Weiser, M. Program slicing. In Intl. Conf. on Software Engineering (1981). Google ScholarDigital Library
Wheeler, D. SLOCCount. http://www.dwheeler.com/sloccount/, 2010.Google Scholar
Wilson, P. F., Dell, L. D., and Anderson, G. F. Root Cause Analysis : A Tool for Total Quality Management. American Society for Quality, 1993.Google Scholar
Wu, J., Cui, H., and Yang, J. Bypassing races in live applications with execution filters. In Symp. on Operating Sys. Design and Implem. (2010). Google ScholarDigital Library
Xin, B., Sumner, W. N., and Zhang, X. Efficient program execution indexing. In Intl. Conf. on Programming Language Design and Implem. (2008). Google ScholarDigital Library
Yu, J., and Narayanasamy, S. A case for an interleaving constrained shared-memory multi-processor. In Intl. Symp. on Computer Architecture (2009). Google ScholarDigital Library
Yu, Y., Rodeheffer, T., and Chen, W. Racetrack: Efficient detection of data race conditions via adaptive tracking. In Symp. on Operating Systems Principles (2005). Google ScholarDigital Library
Yuan, D., Mai, H., Xiong, W., Tan, L., Zhou, Y., and Pasupathy, S. SherLog: error diagnosis by connecting clues from run-time logs. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2010). Google ScholarDigital Library
Zamfir, C., Altekar, G., Candea, G., and Stoica, I. Debug determinism: The sweet spot for replay-based debugging. In Workshop on Hot Topics in Operating Systems (2011). Google ScholarDigital Library
Zeller, A., and Hildebrandt, R. Simplifying and isolating failure-inducing input. IEEE Transactions on Software Engineering (2002). Google ScholarDigital Library
Zhang, W., Lim, J., Olichandran, R., Scherpelz, J., Jin, G., Lu, S., and Reps, T. ConSeq: Detecting concurrency bugs through sequential errors. In Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (2011). Google ScholarDigital Library

Index Terms

Failure sketching: a technique for automated root cause diagnosis of in-production failures

Recommendations

The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems Principles

The end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a ...
Read More
Lazy Diagnosis of In-Production Concurrency Bugs
SOSP '17: Proceedings of the 26th Symposium on Operating Systems Principles

Diagnosing concurrency bugs---the process of understanding the root causes of concurrency failures---is hard. Developers depend on reproducing concurrency bugs to diagnose them. Traditionally, systems that attempt to reproduce concurrency bugs record ...
Read More
Pivot tracing: dynamic causal monitoring for distributed systems
SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

Monitoring and troubleshooting distributed systems is notoriously difficult; potential problems are complex, varied, and unpredictable. The monitoring and diagnosis tools commonly used today -- logs, counters, and metrics -- have two important ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles
October 2015
499 pages
ISBN:9781450338349
DOI:10.1145/2815400
General Chair:
Ethan Miller
UC Santa Cruz
,
Program Chair:
Steven Hand
Google
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 October 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SOSP '15 Paper Acceptance Rate30of181submissions,17%Overall Acceptance Rate131of716submissions,18%
More
Upcoming Conference
SOSP '24

Sponsor:

sigops

ACM SIGOPS 29th Symposium on Operating Systems Principles

November 5 - 8, 2024

Austin , TX , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 62
  Total Citations
  View Citations
- 1,458
  Total Downloads
- Downloads (Last 12 months)104
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Failure sketching: a technique for automated root cause diagnosis of in-production failures

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure

Lazy Diagnosis of In-Production Concurrency Bugs

Pivot tracing: dynamic causal monitoring for distributed systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Failure sketching: a technique for automated root cause diagnosis of in-production failures

SOSP '15: Proceedings of the 25th Symposium on Operating Systems Principles

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure

Lazy Diagnosis of In-Production Concurrency Bugs

Pivot tracing: dynamic causal monitoring for distributed systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media