Abstract
Localizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently multi-level, spanning hardware and software components across the network and the computing infrastructure. To exploit these dependencies for fast, accurate problem localization, we introduce an Inference Graph model, which is well-adapted to user-perceptible problems rooted in conditions giving rise to both partial service degradation and hard faults. Further, we introduce the Sherlock system to discover Inference Graphs in the operational enterprise, infer critical attributes, and then leverage the result to automatically detect and localize problems. To illuminate strengths and limitations of the approach, we provide results from a prototype deployment in a large enterprise network, as well as from testbed emulations and simulations. In particular, we find that taking into account multi-level structure leads to a 30% improvement in fault localization, as compared to two-level approaches.
- M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance Debugging for Distributed Systems of Black Boxes. In SOSP, Oct. 2003. Google ScholarDigital Library
- W. Aiello, C. Kalmanek, P. McDaniel, S. Sen, O. Spatscheck, and J. V. der Merwe. Analysis of Communities of Interest in Data Networks. In PAM, Mar. 2005. Google ScholarDigital Library
- P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In OSDI, Dec. 2004. Google ScholarDigital Library
- M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. Path-based failure and evolution management. In NSDI'04, Mar. 2004. Google ScholarDigital Library
- J. Dunagan, N. J. A. Harvey, M. B. Jones, D. Kostic, M. Theimer, and A. Wolman. FUSE: Lightweight Guaranteed Distributed Failure Notification. In OSDI, 2004. Google ScholarDigital Library
- S. Kandula, D. Katabi, and J.-P. Vasseur. Shrink: A Tool for Failure Diagnosis in IP Networks. In Proc. MineNet Workshop at SIGCOMM, 2005. Google ScholarDigital Library
- R. R. Kompella, J. Yates, A. Greenberg, and A. Snoeren. IP Fault Localization Via Risk Modeling. In Proc. of NSDI, May 2005. Google ScholarDigital Library
- D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. Google ScholarDigital Library
- R. Mahajan, N. Spring, D. Wetherall, and T. Anderson. User-level Internet Path Diagnosis. In SOSP, Oct. 2003. Google ScholarDigital Library
- Microsoft Operations Manager. http://www.microsoft.com/mom/.Google Scholar
- Multi Router Traffic Grapher. http://www.mrtg.com/.Google Scholar
- K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy Belief Propagation for Approximate Inference: An Empirical Study. In Uncertainity in Artificial Intelligence, 1999. Google ScholarDigital Library
- HP Openview. http://www.openview.hp.com/.Google Scholar
- R. Pang, M. Allman, M. Bennett, J. Lee, V. Paxson, and B. Tierney. A First Look at Modern Enterprise Traffic. In IMC, Oct. 2005. Google ScholarDigital Library
- J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. Google ScholarDigital Library
- P. Reynolds, J. L. Wiener, J. C. Mogul, M. K. Aguilera, and A. Vahdat. WAP5: Black-box Performance Debugging for Wide-area Systems. In WWW, May 2006. Google ScholarDigital Library
- I. Rish, M. Brodie, and S. Ma. Efficient Fault Diagnosis Using Probing. In AAAI Spring Symposium on Information Refinement and Revision for Decision Making, March 2002.Google Scholar
- J. Sommers, P. Barford, N. Duffield, and A. Ron. Improving Accuracy in End-to-end Packet Loss Measurement. In SIGCOMM, 2005. Google ScholarDigital Library
- IBM Tivoli. http://www.ibm.com/software/tivoli/.Google Scholar
- http://www.winpcap.org/.Google Scholar
- S. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High Speed and Robust Event Correlation. In IEEE Communications Magazine, 1996. Google ScholarDigital Library
Index Terms
- Towards highly reliable enterprise network services via inference of multi-level dependencies
Recommendations
Towards highly reliable enterprise network services via inference of multi-level dependencies
SIGCOMM '07: Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communicationsLocalizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently multi-level, spanning hardware and software components across the network and the computing ...
A simple graphical approach for understanding probabilistic inference in Bayesian networks
We present a simple graphical method for understanding exact probabilistic inference in discrete Bayesian networks (BNs). A conditional probability table (conditional) is depicted as a directed acyclic graph involving one or more black vertices and zero ...
Causal Inference Based Service Dependency Graph for Statistical Service Fault Localization
SKG '14: Proceedings of the 2014 10th International Conference on Semantics, Knowledge and GridsIn the interconnection environment, people combine basic services into composite services to provide more complex function for sophisticated applications. Accordingly, service fault localization in composite services becomes a critical issue for ...
Comments