ACM Home Page
Please provide us with feedback. Feedback
Capturing, indexing, clustering, and retrieving system history
Full text PdfPdf (516 KB)
Source ACM Symposium on Operating Systems Principles archive
Proceedings of the twentieth ACM symposium on Operating systems principles table of contents
Brighton, United Kingdom
SESSION: History and context table of contents
Pages: 105 - 118  
Year of Publication: 2005
ISBN:1-59593-079-5
Also published in ...
Authors
Ira Cohen  Hewlett-Packard Laboratories, Palo Alto, CA
Steve Zhang  Stanford University, Palo Alto, CA
Moises Goldszmidt  Hewlett-Packard Laboratories, Palo Alto, CA
Julie Symons  Hewlett-Packard Laboratories, Palo Alto, CA
Terence Kelly  Hewlett-Packard Laboratories, Palo Alto, CA
Armando Fox  Hewlett-Packard Laboratories, Palo Alto, CA
Sponsors
ACM: Association for Computing Machinery
SIGOPS: ACM Special Interest Group on Operating Systems
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 27,   Downloads (12 Months): 229,   Citation Count: 11
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1095810.1095821
What is a DOI?

ABSTRACT

We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual ``raw'' values of collected measurements is ineffective, leading us to a more sophisticated approach based on statistical modeling and inference. Our method requires only that the system's metric of merit (such as average transaction response time) as well as a collection of lower-level operational metrics be collected, as is done by existing commercial monitoring tools. Even if the traces have no annotations of prior diagnoses of observed incidents (as is typical), our technique successfully clusters system states corresponding to similar problems, allowing diagnosticians to identify recurring problems and to characterize the ``syndrome'' of a group of problems. We validate our approach on both synthetic traces and several weeks of production traces from a customer-facing geoplexed 24 x 7 system; in the latter case, our approach identified a recurring problem that had required extensive manual diagnosis, and also aided the operators in correcting a previous misdiagnosis of a different problem.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In Proc. 6th USENIX OSDI, Dec. 2004.
 
3
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. A microrebootable system -- design, implementation, and evaluation. In Proc. 6th USENIX OSDI, San Francisco, Dec. 2004.
 
4
 
5
I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In Proc. 6th USENIX OSDI, San Francisco, CA, Dec. 2004.
 
6
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, second edition, 2001.
 
7
D. Jacobs. Distributed computing with BEA WebLogic server. In Proceedings of the Conference on Innovative Data Systems Research, Asilomar, CA, Jan. 2003.
 
8
R. Jain. The Art of Computer Systems Performance Analysis. Wiley-Interscience, New York, NY, 1991.
 
9
J. O. Kephart and W. C. Arnold. Signatures. In Proc. 4th Virus Bulletin International Conference, 1994. http://www.research.ibm.com/antivirus/SciPapers/Kephart/VB94/vb94.html.
 
10
E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services. IEEE Transactions on Neural Networks, Spring 2005.
 
11
D. Mosberger and T. Jin. httperf---a tool for measuring web server performance. http://www.hpl.hp.com/personal/David_Mosberger/httperf.html.
 
12
B. Mukherjee, L. T. Heberlein, and K. N. Levitt. Network intrusion detection. IEEE Network, 8(3):26--41, May 1994.
13
 
14
 
15
J. A. Redstone, M. M. Swift, and B. N. Bershad. Using computers to diagnose computer problems. In Proc. HotOS IX, pages 91--96, May 2003.
 
16
M. Steinder and A. S. Sethi. A survey of fault localization techniques in computer networks. Science of Computer Programming, 53:165--194, 2004.
 
17
System Management Arts (SMARTS) Inc. Automating root cause analysis, 2001. http://www.smarts.com.
 
18
The Open Group. Application Response Measurement (ARM) 2.0 Technical Standard, July 1998. http://www.opengroup.org/onlinepubs/009619299/toc.pdf.
19
 
20
 
21
S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High speed and robust event correlation. IEEE Communications Magazine, pages 82--90, May 1996.
 
22

CITED BY  11
 
 
 
 
 
 

Collaborative Colleagues:
Ira Cohen: colleagues
Steve Zhang: colleagues
Moises Goldszmidt: colleagues
Julie Symons: colleagues
Terence Kelly: colleagues
Armando Fox: colleagues