|
ABSTRACT
We present a method for automatically extracting from a running system an indexable signature that distills the essential characteristic from a system state and that can be subjected to automated clustering and similarity-based retrieval to identify when an observed system state is similar to a previously-observed state. This allows operators to identify and quantify the frequency of recurrent problems, to leverage previous diagnostic efforts, and to establish whether problems seen at different installations of the same site are similar or distinct. We show that the naive approach to constructing these signatures based on simply recording the actual ``raw'' values of collected measurements is ineffective, leading us to a more sophisticated approach based on statistical modeling and inference. Our method requires only that the system's metric of merit (such as average transaction response time) as well as a collection of lower-level operational metrics be collected, as is done by existing commercial monitoring tools. Even if the traces have no annotations of prior diagnoses of observed incidents (as is typical), our technique successfully clusters system states corresponding to similar problems, allowing diagnosticians to identify recurring problems and to characterize the ``syndrome'' of a group of problems. We validate our approach on both synthetic traces and several weeks of production traces from a customer-facing geoplexed 24 x 7 system; in the latter case, our approach identified a recurring problem that had required extensive manual diagnosis, and also aided the operators in correcting a previous misdiagnosis of a different problem.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Marcos K. Aguilera , Jeffrey C. Mogul , Janet L. Wiener , Patrick Reynolds , Athicha Muthitacharoen, Performance debugging for distributed systems of black boxes, Proceedings of the nineteenth ACM symposium on Operating systems principles, October 19-22, 2003, Bolton Landing, NY, USA
|
| |
2
|
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for request extraction and workload modelling. In Proc. 6th USENIX OSDI, Dec. 2004.
|
| |
3
|
G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. A microrebootable system -- design, implementation, and evaluation. In Proc. 6th USENIX OSDI, San Francisco, Dec. 2004.
|
| |
4
|
Mike Y. Chen , Emre Kiciman , Eugene Fratkin , Armando Fox , Eric Brewer, Pinpoint: Problem Determination in Large, Dynamic Internet Services, Proceedings of the 2002 International Conference on Dependable Systems and Networks, p.595-604, June 23-26, 2002
|
| |
5
|
I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In Proc. 6th USENIX OSDI, San Francisco, CA, Dec. 2004.
|
| |
6
|
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. Wiley, second edition, 2001.
|
| |
7
|
D. Jacobs. Distributed computing with BEA WebLogic server. In Proceedings of the Conference on Innovative Data Systems Research, Asilomar, CA, Jan. 2003.
|
| |
8
|
R. Jain. The Art of Computer Systems Performance Analysis. Wiley-Interscience, New York, NY, 1991.
|
| |
9
|
J. O. Kephart and W. C. Arnold. Signatures. In Proc. 4th Virus Bulletin International Conference, 1994. http://www.research.ibm.com/antivirus/SciPapers/Kephart/VB94/vb94.html.
|
| |
10
|
E. Kiciman and A. Fox. Detecting application-level failures in component-based internet services. IEEE Transactions on Neural Networks, Spring 2005.
|
| |
11
|
D. Mosberger and T. Jin. httperf---a tool for measuring web server performance. http://www.hpl.hp.com/personal/David_Mosberger/httperf.html.
|
| |
12
|
B. Mukherjee, L. T. Heberlein, and K. N. Levitt. Network intrusion detection. IEEE Network, 8(3):26--41, May 1994.
|
 |
13
|
|
| |
14
|
|
| |
15
|
J. A. Redstone, M. M. Swift, and B. N. Bershad. Using computers to diagnose computer problems. In Proc. HotOS IX, pages 91--96, May 2003.
|
| |
16
|
M. Steinder and A. S. Sethi. A survey of fault localization techniques in computer networks. Science of Computer Programming, 53:165--194, 2004.
|
| |
17
|
System Management Arts (SMARTS) Inc. Automating root cause analysis, 2001. http://www.smarts.com.
|
| |
18
|
The Open Group. Application Response Measurement (ARM) 2.0 Technical Standard, July 1998. http://www.opengroup.org/onlinepubs/009619299/toc.pdf.
|
 |
19
|
Matt Welsh , David Culler , Eric Brewer, SEDA: an architecture for well-conditioned, scalable internet services, Proceedings of the eighteenth ACM symposium on Operating systems principles, October 21-24, 2001, Banff, Alberta, Canada
|
| |
20
|
|
| |
21
|
S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High speed and robust event correlation. IEEE Communications Magazine, pages 82--90, May 1996.
|
| |
22
|
|
CITED BY 11
|
|
Hai Huang , Raymond Jennings, III , Yaoping Ruan , Ramendra Sahoo , Sambit Sahu , Anees Shaikh, PDA: a tool for automated problem determination, Proceedings of the 21st conference on 21st Large Installation System Administration Conference, p.1-14, November 11-16, 2007, Dallas
|
|
|
|
|
|
|
|
Prasenjit Sarkar , Ramani Routray , Eric Butler , Chung-hao Tan , Kaladhar Voruganti , Kiyoung Yang, SPIKE: best practice generation for storage area networks, Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques, p.1-6, April 10, 2007, Cambridge, MA
|
|
|
Christopher Stewart , Ming Zhong , Kai Shen , Thomas O'Neill, Comprehensive depiction of configuration-dependent performance anomalies in distributed server systems, Proceedings of the 2nd conference on Hot Topics in System Dependability, p.1-1, November 08, 2006, Seattle, WA
|
|
|
|
Chun Yuan , Ni Lao , Ji-Rong Wen , Jiwei Li , Zheng Zhang , Yi-Min Wang , Wei-Ying Ma, Automated known problem diagnosis with event traces, ACM SIGOPS Operating Systems Review, v.40 n.4, October 2006
|
|
|
|
|
|
|
|
|
|
|
|