skip to main content
10.1145/1095810.1095821acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
Article

Capturing, indexing, clustering, and retrieving system history

Published: 20 October 2005 Publication History

Abstract

We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual ``raw'' values of collected measurements is ineffective, leading us to a more sophisticated approach based on statistical modeling and inference. Our method requires only that the system's metric of merit (such as average transaction response time) as well as a collection of lower-level operational metrics be collected, as is done by existing commercial monitoring tools. Even if the traces have no annotations of prior diagnoses of observed incidents (as is typical), our technique successfully clusters system states corresponding to similar problems, allowing diagnosticians to identify recurring problems and to characterize the ``syndrome'' of a group of problems. We validate our approach on both synthetic traces and several weeks of production traces from a customer-facing geoplexed 24 x 7 system; in the latter case, our approach identified a recurring problem that had required extensive manual diagnosis, and also aided the operators in correcting a previous misdiagnosis of a different problem.

References

[1]
M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proc. 19th ACM SOSP, 2003.
[2]
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In Proc. 6th USENIX OSDI, Dec. 2004.
[3]
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. A microrebootable system -- design, implementation, and evaluation. In Proc. 6th USENIX OSDI, San Francisco, Dec. 2004.
[4]
M. Chen, E. Kiciman, E. Fratkin, E. Brewer, and A. Fox. Pinpoint: Problem determination in large, dynamic, Internet services. In Proc. International Conference on Dependable Systems and Networks, pages 595--604, Washington, DC, June 2002.
[5]
I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In Proc. 6th USENIX OSDI, San Francisco, CA, Dec. 2004.
[6]
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, second edition, 2001.
[7]
D. Jacobs. Distributed computing with BEA WebLogic server. In Proceedings of the Conference on Innovative Data Systems Research, Asilomar, CA, Jan. 2003.
[8]
R. Jain. The Art of Computer Systems Performance Analysis. Wiley-Interscience, New York, NY, 1991.
[9]
J. O. Kephart and W. C. Arnold. Signatures. In Proc. 4th Virus Bulletin International Conference, 1994. http://www.research.ibm.com/antivirus/SciPapers/Kephart/VB94/vb94.html.
[10]
E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services. IEEE Transactions on Neural Networks, Spring 2005.
[11]
D. Mosberger and T. Jin. httperf---a tool for measuring web server performance. http://www.hpl.hp.com/personal/David_Mosberger/httperf.html.
[12]
B. Mukherjee, L. T. Heberlein, and K. N. Levitt. Network intrusion detection. IEEE Network, 8(3):26--41, May 1994.
[13]
V. S. Pai, P. Druschel, and W. Zwaenepoel. IO-Lite: A unified I/O buffering and caching system. ACM Trans. Comput. Sys., 18(1):37--66, Feb. 2000.
[14]
J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.
[15]
J. A. Redstone, M. M. Swift, and B. N. Bershad. Using computers to diagnose computer problems. In Proc. HotOS IX, pages 91--96, May 2003.
[16]
M. Steinder and A. S. Sethi. A survey of fault localization techniques in computer networks. Science of Computer Programming, 53:165--194, 2004.
[17]
System Management Arts (SMARTS) Inc. Automating root cause analysis, 2001. http://www.smarts.com.
[18]
The Open Group. Application Response Measurement (ARM) 2.0 Technical Standard, July 1998. http://www.opengroup.org/onlinepubs/009619299/toc.pdf.
[19]
M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable Internet services. In Proc. 18th ACM SOSP, 2001.
[20]
I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Academic Press, 2000.
[21]
S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High speed and robust event correlation. IEEE Communications Magazine, pages 82--90, May 1996.
[22]
S. Zhang, I. Cohen, M. Goldszmidt, J. Symons, and A. Fox. Ensembles of models for automated diagnosis of system performance problems. In DSN, 2005.

Cited By

View all
  • (2024)Reducing the Length of Field-Replay Based Load TestingIEEE Transactions on Software Engineering10.1109/TSE.2024.340807950:8(1967-1983)Online publication date: Aug-2024
  • (2023)IoPV: On Inconsistent Option Performance VariationsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616319(845-857)Online publication date: 30-Nov-2023
  • (2023)YISHAN: Managing Large-scale Cloud Database Instances via Machine LearningIEEE Transactions on Services Computing10.1109/TSC.2021.313124916:1(724-738)Online publication date: 1-Jan-2023
  • Show More Cited By

Index Terms

  1. Capturing, indexing, clustering, and retrieving system history

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles
    October 2005
    259 pages
    ISBN:1595930795
    DOI:10.1145/1095810
    • cover image ACM SIGOPS Operating Systems Review
      ACM SIGOPS Operating Systems Review  Volume 39, Issue 5
      SOSP '05
      December 2005
      290 pages
      ISSN:0163-5980
      DOI:10.1145/1095809
      Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 October 2005

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. bayesian networks
    2. clustering
    3. information retrieval
    4. performance objectives
    5. signatures

    Qualifiers

    • Article

    Conference

    SOSP05
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 174 of 961 submissions, 18%

    Upcoming Conference

    SOSP '25
    ACM SIGOPS 31st Symposium on Operating Systems Principles
    October 13 - 16, 2025
    Seoul , Republic of Korea

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)33
    • Downloads (Last 6 weeks)9
    Reflects downloads up to 03 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Reducing the Length of Field-Replay Based Load TestingIEEE Transactions on Software Engineering10.1109/TSE.2024.340807950:8(1967-1983)Online publication date: Aug-2024
    • (2023)IoPV: On Inconsistent Option Performance VariationsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616319(845-857)Online publication date: 30-Nov-2023
    • (2023)YISHAN: Managing Large-scale Cloud Database Instances via Machine LearningIEEE Transactions on Services Computing10.1109/TSC.2021.313124916:1(724-738)Online publication date: 1-Jan-2023
    • (2023)CP Decomposition and Set Theory based Root Cause Analysis in Online Service Systems2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00032(221-228)Online publication date: 4-Dec-2023
    • (2021)A Survey of AIOps Methods for Failure ManagementACM Transactions on Intelligent Systems and Technology10.1145/348342412:6(1-45)Online publication date: 30-Nov-2021
    • (2021)Predicting Performance Anomalies in Software Systems at Run-timeACM Transactions on Software Engineering and Methodology10.1145/344075730:3(1-33)Online publication date: 23-Apr-2021
    • (2021)Enhanced anomaly scores for isolation forestsPattern Recognition10.1016/j.patcog.2021.108115120:COnline publication date: 1-Dec-2021
    • (2021)Accuracy vs. complexityPattern Recognition10.1016/j.patcog.2021.108106120:COnline publication date: 1-Dec-2021
    • (2020)GandalfProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388271(389-402)Online publication date: 25-Feb-2020
    • (2020)Performance regression detection in DevOpsProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings10.1145/3377812.3381386(206-209)Online publication date: 27-Jun-2020
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media