Article

Capturing, indexing, clustering, and retrieving system history

Authors:

Moises Goldszmidt,

Armando FoxAuthors Info & Claims

SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles

Pages 105 - 118

https://doi.org/10.1145/1095810.1095821

Published: 20 October 2005 Publication History

Abstract

We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual ``raw'' values of collected measurements is ineffective, leading us to a more sophisticated approach based on statistical modeling and inference. Our method requires only that the system's metric of merit (such as average transaction response time) as well as a collection of lower-level operational metrics be collected, as is done by existing commercial monitoring tools. Even if the traces have no annotations of prior diagnoses of observed incidents (as is typical), our technique successfully clusters system states corresponding to similar problems, allowing diagnosticians to identify recurring problems and to characterize the ``syndrome'' of a group of problems. We validate our approach on both synthetic traces and several weeks of production traces from a customer-facing geoplexed 24 x 7 system; in the latter case, our approach identified a recurring problem that had required extensive manual diagnosis, and also aided the operators in correcting a previous misdiagnosis of a different problem.

References

[1]

M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proc. 19th ACM SOSP, 2003.

Digital Library

[2]

P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In Proc. 6th USENIX OSDI, Dec. 2004.

Digital Library

[3]

G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. A microrebootable system -- design, implementation, and evaluation. In Proc. 6th USENIX OSDI, San Francisco, Dec. 2004.

[4]

M. Chen, E. Kiciman, E. Fratkin, E. Brewer, and A. Fox. Pinpoint: Problem determination in large, dynamic, Internet services. In Proc. International Conference on Dependable Systems and Networks, pages 595--604, Washington, DC, June 2002.

Digital Library

[5]

I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In Proc. 6th USENIX OSDI, San Francisco, CA, Dec. 2004.

Digital Library

[6]

R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, second edition, 2001.

Digital Library

[7]

D. Jacobs. Distributed computing with BEA WebLogic server. In Proceedings of the Conference on Innovative Data Systems Research, Asilomar, CA, Jan. 2003.

[8]

R. Jain. The Art of Computer Systems Performance Analysis. Wiley-Interscience, New York, NY, 1991.

[9]

J. O. Kephart and W. C. Arnold. Signatures. In Proc. 4th Virus Bulletin International Conference, 1994. http://www.research.ibm.com/antivirus/SciPapers/Kephart/VB94/vb94.html.

[10]

E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services. IEEE Transactions on Neural Networks, Spring 2005.

Digital Library

[11]

D. Mosberger and T. Jin. httperf---a tool for measuring web server performance. http://www.hpl.hp.com/personal/David_Mosberger/httperf.html.

[12]

B. Mukherjee, L. T. Heberlein, and K. N. Levitt. Network intrusion detection. IEEE Network, 8(3):26--41, May 1994.

Digital Library

[13]

V. S. Pai, P. Druschel, and W. Zwaenepoel. IO-Lite: A unified I/O buffering and caching system. ACM Trans. Comput. Sys., 18(1):37--66, Feb. 2000.

Digital Library

[14]

J. Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.

Digital Library

[15]

J. A. Redstone, M. M. Swift, and B. N. Bershad. Using computers to diagnose computer problems. In Proc. HotOS IX, pages 91--96, May 2003.

Digital Library

[16]

M. Steinder and A. S. Sethi. A survey of fault localization techniques in computer networks. Science of Computer Programming, 53:165--194, 2004.

[17]

System Management Arts (SMARTS) Inc. Automating root cause analysis, 2001. http://www.smarts.com.

[18]

The Open Group. Application Response Measurement (ARM) 2.0 Technical Standard, July 1998. http://www.opengroup.org/onlinepubs/009619299/toc.pdf.

[19]

M. Welsh, D. Culler, and E. Brewer. SEDA: An architecture for well-conditioned, scalable Internet services. In Proc. 18th ACM SOSP, 2001.

Digital Library

[20]

I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Academic Press, 2000.

Digital Library

[21]

S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High speed and robust event correlation. IEEE Communications Magazine, pages 82--90, May 1996.

Digital Library

[22]

S. Zhang, I. Cohen, M. Goldszmidt, J. Symons, and A. Fox. Ensembles of models for automated diagnosis of system performance problems. In DSN, 2005.

Digital Library

Cited By

Xia YLiao LChen JLi HShang W(2024)Reducing the Length of Field-Replay Based Load TestingIEEE Transactions on Software Engineering10.1109/TSE.2024.340807950:8(1967-1983)Online publication date: Aug-2024
https://doi.org/10.1109/TSE.2024.3408079
Chen JDing ZTang YSayagh MLi HAdams BShang WChandra SBlincoe KTonella P(2023)IoPV: On Inconsistent Option Performance VariationsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616319(845-857)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616319
Xiao WYang CWang JZhu XBao WFeng XXie YCao WYu FLiu L(2023)YISHAN: Managing Large-scale Cloud Database Instances via Machine LearningIEEE Transactions on Services Computing10.1109/TSC.2021.313124916:1(724-738)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TSC.2021.3131249
Show More Cited By

Index Terms

Capturing, indexing, clustering, and retrieving system history
1. Computing methodologies
  1. Modeling and simulation
    1. Model development and analysis
      1. Modeling methodologies

Recommendations

Capturing, indexing, clustering, and retrieving system history
SOSP '05

We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify ...
Capturing, indexing, and retrieving system history
Heuristic information retrieval: a competition-based connectionist model
RIAO '94: Intelligent Multimedia Information Retrieval Systems and Management - Volume 1

In this paper, we adapt a competition-based connectionist model, which has been proposed for diagnostic problem solving, to information retrieval. In our model, documents are treated as "disorders" and user information needs as "manifestations", and a ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles

October 2005

259 pages

ISBN:1595930795

DOI:10.1145/1095810

General Chair:
Andrew Herbert
Microsoft Research, UK
,
Program Chair:
Ken Birman
Cornell University, USA

ACM SIGOPS Operating Systems Review Volume 39, Issue 5
SOSP '05
December 2005
290 pages
ISSN:0163-5980
DOI:10.1145/1095809
Issue’s Table of Contents

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 October 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SOSP05

Sponsor:

SOSP05: ACM SIGOPS 20th Symposium on Operating Systems Principles 2005

October 23 - 26, 2005

Brighton, United Kingdom

Acceptance Rates

Overall Acceptance Rate 174 of 961 submissions, 18%

Upcoming Conference

SOSP '25

Sponsor:
sigops

ACM SIGOPS 31st Symposium on Operating Systems Principles

October 13 - 16, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

242
Total Citations
View Citations
1,704
Total Downloads

Downloads (Last 12 months)33
Downloads (Last 6 weeks)9

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Xia YLiao LChen JLi HShang W(2024)Reducing the Length of Field-Replay Based Load TestingIEEE Transactions on Software Engineering10.1109/TSE.2024.340807950:8(1967-1983)Online publication date: Aug-2024
https://doi.org/10.1109/TSE.2024.3408079
Chen JDing ZTang YSayagh MLi HAdams BShang WChandra SBlincoe KTonella P(2023)IoPV: On Inconsistent Option Performance VariationsProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616319(845-857)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616319
Xiao WYang CWang JZhu XBao WFeng XXie YCao WYu FLiu L(2023)YISHAN: Managing Large-scale Cloud Database Instances via Machine LearningIEEE Transactions on Services Computing10.1109/TSC.2021.313124916:1(724-738)Online publication date: 1-Jan-2023
https://doi.org/10.1109/TSC.2021.3131249
Wei QTan QZhang YXu ZTan H(2023)CP Decomposition and Set Theory based Root Cause Analysis in Online Service Systems2023 30th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC60848.2023.00032(221-228)Online publication date: 4-Dec-2023
https://doi.org/10.1109/APSEC60848.2023.00032
Notaro PCardoso JGerndt M(2021)A Survey of AIOps Methods for Failure ManagementACM Transactions on Intelligent Systems and Technology10.1145/348342412:6(1-45)Online publication date: 30-Nov-2021
https://dl.acm.org/doi/10.1145/3483424
Zhao GHassan SZou YTruong DCorbin T(2021)Predicting Performance Anomalies in Software Systems at Run-timeACM Transactions on Software Engineering and Methodology10.1145/344075730:3(1-33)Online publication date: 23-Apr-2021
https://dl.acm.org/doi/10.1145/3440757
Mensi ABicego M(2021)Enhanced anomaly scores for isolation forestsPattern Recognition10.1016/j.patcog.2021.108115120:COnline publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1016/j.patcog.2021.108115
Farazi MKhan SBarnes N(2021)Accuracy vs. complexityPattern Recognition10.1016/j.patcog.2021.108106120:COnline publication date: 1-Dec-2021
https://dl.acm.org/doi/10.1016/j.patcog.2021.108106
Li ZCheng QHsieh KDang YHuang PSingh PYang XLin QWu YLevy SChintalapati MBhagwan RPorter G(2020)GandalfProceedings of the 17th Usenix Conference on Networked Systems Design and Implementation10.5555/3388242.3388271(389-402)Online publication date: 25-Feb-2020
https://dl.acm.org/doi/10.5555/3388242.3388271
Chen JRothermel GBae D(2020)Performance regression detection in DevOpsProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Companion Proceedings10.1145/3377812.3381386(206-209)Online publication date: 27-Jun-2020
https://dl.acm.org/doi/10.1145/3377812.3381386
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten