ABSTRACT
We propose a machine learning based framework for building a hierarchical monitoring system to detect and diagnose service issues. We demonstrate its use for building a monitoring system for a distributed data storage and computing service consisting of tens of thousands of machines. Our solution has been deployed in production as an end-to-end system, starting from telemetry data collection from individual machines, to a visualization tool for service operators to examine the detection outputs. Evaluation results are presented on detecting 19 customer impacting issues in the past three months.
Supplemental Material
- N. Begum and E. J. Keogh. Rare time series motif discovery from unbounded streams. PVLDB, 8(2):149--160, 2014. Google ScholarDigital Library
- P. Bodík, M. Goldszmidt, A. Fox, D. B. Woodard, and H. Andersen. Fingerprinting the datacenter: automated classification of performance crises. In EuroSys 2010, pages 111--124, 2010. Google ScholarDigital Library
- M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander. LOF: Identifying density-based local outliers. SIGMOD Rec., 29(2):93--104, May 2000. Google ScholarDigital Library
- V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection for discrete sequences: A survey. IEEE Trans. Knowl. Data Eng., 24(5):823--839, 2012. Google ScholarDigital Library
- H. Chen, G. Jiang, and K. Yoshihira. Failure detection in large-scale internet services by principal subspace mapping. IEEE Trans. Knowl. Data Eng., 19(10):1308--1320, 2007. Google ScholarDigital Library
- M. Y. Chen, A. X. Zheng, J. Lloyd, M. I. Jordan, and E. A. Brewer. Failure diagnosis using decision trees. In ICAC, pages 36--43, 2004. Google ScholarDigital Library
- B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 315:972--976, 2007.Google ScholarCross Ref
- Q. Fu, J. Lou, Q. Lin, R. Ding, D. Zhang, Z. Ye, and T. Xie. Performance issue diagnosis for online service systems. In SRDS, pages 273--278, 2012. Google ScholarDigital Library
- Q. Fu, J. Lou, Y. Wang, and J. Li. Execution anomaly detection in distributed systems through unstructured log analysis. In ICDM, pages 149--158, 2009. Google ScholarDigital Library
- M. Gabel, R. Glad-Bachrach, N. Bjorner, and A. Schuster. Latent fault detection in cloud services. Technical Report Technical Report, Microsoft Research, 2011.Google Scholar
- A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. in JMLR, 13(1):723--773, Mar. 2012. Google ScholarDigital Library
- S.-S. Ho and H. Wechsler. A martingale framework for detecting changes in data streams by testing exchangeability. IEEE PAMI, 32(12):2113--2127, 2010. Google ScholarDigital Library
- D. Lin, R. Raghu, V. Ramamurthy, J. Yu, R. Radhakrishnan, and J. Fernandez. Unveiling clusters of events for alert and incident management in large-scale enterprise it. In KDD, pages 1630--1639, 2014. Google ScholarDigital Library
- J. Lou, Q. Fu, S. Yang, Y. Xu, and J. Li. Mining invariants from console logs for system problem detection. In USENIX, 2010. Google ScholarDigital Library
- J. Lou, Q. Lin, R. Ding, Q. Fu, D. Zhang, and T. Xie. Software analytics for incident management of online services: An experience report. In IEEE/ACM ASE, 2013.Google Scholar
- C. Luo, J. Lou, Q. Lin, Q. Fu, R. Ding, D. Zhang, and Z. Wang. Correlating events with time series for incident diagnosis. In KDD, pages 1583--1592, 2014. Google ScholarDigital Library
- N. Meinshausen and P. Buhlmann. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436--1462, June 2006.Google ScholarCross Ref
- A. Mueen. Time series motif discovery: dimensions and applications. Data Mining and Knowledge Discovery, 4(2):152--159, 2014. Google ScholarDigital Library
- H. Qiu, Y. Liu, N. A. Subrahmanya, and W. Li. Granger causality for time-series anomaly detection. In ICDM, pages 1074--1079, 2012. Google ScholarDigital Library
- S. Roy, A. C. Koíg, I. Dvorkin, and M. Kumar. Perfaugur: Robust diagnostics for performance anomalies in cloud services. In ICDE, 2015, 2015.Google Scholar
- B. Scholkopf, J. C. Platt, J. C. Shawe-Taylor, A. J. Smola, and R. C. Williamson. Estimating the support of a high-dimensional distribution. Neural Computation, 2001. Google ScholarDigital Library
- L. Tang and T. Li. Logtree: A framework for generating system events from raw textual logs. In ICDM, pages 491--500, 2010. Google ScholarDigital Library
- R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58:267--288, 1996.Google ScholarCross Ref
Index Terms
- Learning a Hierarchical Monitoring System for Detecting and Diagnosing Service Issues
Recommendations
Robust Anomaly Detection for Multivariate Time Series through Stochastic Recurrent Neural Network
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningIndustry devices (i.e., entities) such as server machines, spacecrafts, engines, etc., are typically monitored with multivariate time series, whose anomaly detection is critical for an entity's service quality management. However, due to the complex ...
Correlating events with time series for incident diagnosis
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningAs online services have more and more popular, incident diagnosis has emerged as a critical task in minimizing the service downtime and ensuring high quality of the services provided. For most online services, incident diagnosis is mainly conducted by ...
Generic and Scalable Framework for Automated Time-series Anomaly Detection
KDD '15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data MiningThis paper introduces a generic and scalable framework for automated anomaly detection on large scale time-series data. Early detection of anomalies plays a key role in maintaining consistency of person's data and protects corporations against malicious ...
Comments