research-article

Learning a Hierarchical Monitoring System for Detecting and Diagnosing Service Issues

Authors:
Vinod Nair

Microsoft Research India, Bangalore, India

Microsoft Research India, Bangalore, India
View Profile

,
Ameya Raul

Microsoft Research India, Bangalore, India

Microsoft Research India, Bangalore, India
View Profile

,
Shwetabh Khanduja

Microsoft Research India, Bangalore, India

Microsoft Research India, Bangalore, India
View Profile

,
Vikas Bahirwani

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Qihong Shao

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Sundararajan Sellamanickam

Microsoft Research India, Bangalore, India

Microsoft Research India, Bangalore, India
View Profile

,
Sathiya Keerthi

Microsoft, Mountain View, CA, USA

Microsoft, Mountain View, CA, USA
View Profile

,
Steve Herbert

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Sudheer Dhulipalla

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningAugust 2015Pages 2029–2038https://doi.org/10.1145/2783258.2788624

Published:10 August 2015Publication History

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 2029–2038

ABSTRACT

We propose a machine learning based framework for building a hierarchical monitoring system to detect and diagnose service issues. We demonstrate its use for building a monitoring system for a distributed data storage and computing service consisting of tens of thousands of machines. Our solution has been deployed in production as an end-to-end system, starting from telemetry data collection from individual machines, to a visualization tool for service operators to examine the detection outputs. Evaluation results are presented on detecting 19 customer impacting issues in the past three months.

Supplemental Material

p2029.mp4

mp4

288.1 MB

Download

References

N. Begum and E. J. Keogh. Rare time series motif discovery from unbounded streams. PVLDB, 8(2):149--160, 2014. Google ScholarDigital Library
P. Bodík, M. Goldszmidt, A. Fox, D. B. Woodard, and H. Andersen. Fingerprinting the datacenter: automated classification of performance crises. In EuroSys 2010, pages 111--124, 2010. Google ScholarDigital Library
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD Rec., 29(2):93--104, May 2000. Google ScholarDigital Library
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection for discrete sequences: A survey. IEEE Trans. Knowl. Data Eng., 24(5):823--839, 2012. Google ScholarDigital Library
H. Chen, G. Jiang, and K. Yoshihira. Failure detection in large-scale internet services by principal subspace mapping. IEEE Trans. Knowl. Data Eng., 19(10):1308--1320, 2007. Google ScholarDigital Library
M. Y. Chen, A. X. Zheng, J. Lloyd, M. I. Jordan, and E. A. Brewer. Failure diagnosis using decision trees. In ICAC, pages 36--43, 2004. Google ScholarDigital Library
B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315:972--976, 2007.Google ScholarCross Ref
Q. Fu, J. Lou, Q. Lin, R. Ding, D. Zhang, Z. Ye, and T. Xie. Performance issue diagnosis for online service systems. In SRDS, pages 273--278, 2012. Google ScholarDigital Library
Q. Fu, J. Lou, Y. Wang, and J. Li. Execution anomaly detection in distributed systems through unstructured log analysis. In ICDM, pages 149--158, 2009. Google ScholarDigital Library
M. Gabel, R. Glad-Bachrach, N. Bjorner, and A. Schuster. Latent fault detection in cloud services. Technical Report Technical Report, Microsoft Research, 2011.Google Scholar
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. in JMLR, 13(1):723--773, Mar. 2012. Google ScholarDigital Library
S.-S. Ho and H. Wechsler. A martingale framework for detecting changes in data streams by testing exchangeability. IEEE PAMI, 32(12):2113--2127, 2010. Google ScholarDigital Library
D. Lin, R. Raghu, V. Ramamurthy, J. Yu, R. Radhakrishnan, and J. Fernandez. Unveiling clusters of events for alert and incident management in large-scale enterprise it. In KDD, pages 1630--1639, 2014. Google ScholarDigital Library
J. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining invariants from console logs for system problem detection. In USENIX, 2010. Google ScholarDigital Library
J. Lou, Q. Lin, R. Ding, Q. Fu, D. Zhang, and T. Xie. Software analytics for incident management of online services: An experience report. In IEEE/ACM ASE, 2013.Google Scholar
C. Luo, J. Lou, Q. Lin, Q. Fu, R. Ding, D. Zhang, and Z. Wang. Correlating events with time series for incident diagnosis. In KDD, pages 1583--1592, 2014. Google ScholarDigital Library
N. Meinshausen and P. Buhlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436--1462, June 2006.Google ScholarCross Ref
A. Mueen. Time series motif discovery: dimensions and applications. Data Mining and Knowledge Discovery, 4(2):152--159, 2014. Google ScholarDigital Library
H. Qiu, Y. Liu, N. A. Subrahmanya, and W. Li. Granger causality for time-series anomaly detection. In ICDM, pages 1074--1079, 2012. Google ScholarDigital Library
S. Roy, A. C. Koíg, I. Dvorkin, and M. Kumar. Perfaugur: Robust diagnostics for performance anomalies in cloud services. In ICDE, 2015, 2015.Google Scholar
B. Scholkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 2001. Google ScholarDigital Library
L. Tang and T. Li. Logtree: A framework for generating system events from raw textual logs. In ICDM, pages 491--500, 2010. Google ScholarDigital Library
R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58:267--288, 1996.Google ScholarCross Ref

Index Terms

Learning a Hierarchical Monitoring System for Detecting and Diagnosing Service Issues
1. Computing methodologies
  1. Machine learning

Recommendations

Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Industry devices (i.e., entities) such as server machines, spacecrafts, engines, etc., are typically monitored with multivariate time series, whose anomaly detection is critical for an entity's service quality management. However, due to the complex ...
Read More
Correlating events with time series for incident diagnosis
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

As online services have more and more popular, incident diagnosis has emerged as a critical task in minimizing the service downtime and ensuring high quality of the services provided. For most online services, incident diagnosis is mainly conducted by ...
Read More
Generic and Scalable Framework for Automated Time-series Anomaly Detection
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

This paper introduces a generic and scalable framework for automated anomaly detection on large scale time-series data. Early detection of anomalies plays a key role in maintaining consistency of person's data and protects corporations against malicious ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2015
2378 pages
ISBN:9781450336642
DOI:10.1145/2783258
General Chairs:
Longbing Cao
University of Technology, Sydney
,
Chengqi Zhang
University of Technology, Sydney
,
Program Chairs:
Thorsten Joachims
Cornell University
,
Geoff Webb
Monash University
,
Dragos D. Margineantu
Boeing Research
,
Graham Williams
Australian Taxation Office
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 August 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
high-dimensional time series
service monitoring
unsupervised learning
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '15 Paper Acceptance Rate160of819submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 45
  Total Citations
  View Citations
- 734
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Learning a Hierarchical Monitoring System for Detecting and Diagnosing Service Issues

KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network

Correlating events with time series for incident diagnosis

Generic and Scalable Framework for Automated Time-series Anomaly Detection