skip to main content
article

Towards highly reliable enterprise network services via inference of multi-level dependencies

Published:27 August 2007Publication History
Skip Abstract Section

Abstract

Localizing the sources of performance problems in large enterprise networks is extremely challenging. Dependencies are numerous, complex and inherently multi-level, spanning hardware and software components across the network and the computing infrastructure. To exploit these dependencies for fast, accurate problem localization, we introduce an Inference Graph model, which is well-adapted to user-perceptible problems rooted in conditions giving rise to both partial service degradation and hard faults. Further, we introduce the Sherlock system to discover Inference Graphs in the operational enterprise, infer critical attributes, and then leverage the result to automatically detect and localize problems. To illuminate strengths and limitations of the approach, we provide results from a prototype deployment in a large enterprise network, as well as from testbed emulations and simulations. In particular, we find that taking into account multi-level structure leads to a 30% improvement in fault localization, as compared to two-level approaches.

References

  1. M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance Debugging for Distributed Systems of Black Boxes. In SOSP, Oct. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. W. Aiello, C. Kalmanek, P. McDaniel, S. Sen, O. Spatscheck, and J. V. der Merwe. Analysis of Communities of Interest in Data Networks. In PAM, Mar. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In OSDI, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. Path-based failure and evolution management. In NSDI'04, Mar. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Dunagan, N. J. A. Harvey, M. B. Jones, D. Kostic, M. Theimer, and A. Wolman. FUSE: Lightweight Guaranteed Distributed Failure Notification. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Kandula, D. Katabi, and J.-P. Vasseur. Shrink: A Tool for Failure Diagnosis in IP Networks. In Proc. MineNet Workshop at SIGCOMM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. R. Kompella, J. Yates, A. Greenberg, and A. Snoeren. IP Fault Localization Via Risk Modeling. In Proc. of NSDI, May 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. J. C. MacKay. Information Theory, Inference, and Learning Algorithms. Cambridge University Press, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Mahajan, N. Spring, D. Wetherall, and T. Anderson. User-level Internet Path Diagnosis. In SOSP, Oct. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Microsoft Operations Manager. http://www.microsoft.com/mom/.Google ScholarGoogle Scholar
  11. Multi Router Traffic Grapher. http://www.mrtg.com/.Google ScholarGoogle Scholar
  12. K. P. Murphy, Y. Weiss, and M. I. Jordan. Loopy Belief Propagation for Approximate Inference: An Empirical Study. In Uncertainity in Artificial Intelligence, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. HP Openview. http://www.openview.hp.com/.Google ScholarGoogle Scholar
  14. R. Pang, M. Allman, M. Bennett, J. Lee, V. Paxson, and B. Tierney. A First Look at Modern Enterprise Traffic. In IMC, Oct. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Reynolds, J. L. Wiener, J. C. Mogul, M. K. Aguilera, and A. Vahdat. WAP5: Black-box Performance Debugging for Wide-area Systems. In WWW, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. I. Rish, M. Brodie, and S. Ma. Efficient Fault Diagnosis Using Probing. In AAAI Spring Symposium on Information Refinement and Revision for Decision Making, March 2002.Google ScholarGoogle Scholar
  18. J. Sommers, P. Barford, N. Duffield, and A. Ron. Improving Accuracy in End-to-end Packet Loss Measurement. In SIGCOMM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. IBM Tivoli. http://www.ibm.com/software/tivoli/.Google ScholarGoogle Scholar
  20. http://www.winpcap.org/.Google ScholarGoogle Scholar
  21. S. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High Speed and Robust Event Correlation. In IEEE Communications Magazine, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Towards highly reliable enterprise network services via inference of multi-level dependencies

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGCOMM Computer Communication Review
      ACM SIGCOMM Computer Communication Review  Volume 37, Issue 4
      October 2007
      420 pages
      ISSN:0146-4833
      DOI:10.1145/1282427
      Issue’s Table of Contents
      • cover image ACM Conferences
        SIGCOMM '07: Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications
        August 2007
        432 pages
        ISBN:9781595937131
        DOI:10.1145/1282380

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 27 August 2007

      Check for updates

      Qualifiers

      • article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader