ABSTRACT
Logs are often used for troubleshooting in large-scale software systems. For a cloud-based online system that provides 24/7 service, a huge number of logs could be generated every day. However, these logs are highly imbalanced in general, because most logs indicate normal system operations, and only a small percentage of logs reveal impactful problems. Problems that lead to the decline of system KPIs (Key Performance Indicators) are impactful and should be fixed by engineers with a high priority. Furthermore, there are various types of system problems, which are hard to be distinguished manually. In this paper, we propose Log3C, a novel clustering-based approach to promptly and precisely identify impactful system problems, by utilizing both log sequences (a sequence of log events) and system KPIs. More specifically, we design a novel cascading clustering algorithm, which can greatly save the clustering time while keeping high accuracy by iteratively sampling, clustering, and matching log sequences. We then identify the impactful problems by correlating the clusters of log sequences with system KPIs. Log3C is evaluated on real-world log data collected from an online service system at Microsoft, and the results confirm its effectiveness and efficiency. Furthermore, our approach has been successfully applied in industrial practice.
- Osama Abu Abbas. 2008. Comparisons Between Data Clustering Algorithms. International Arab Journal of Information Technology (IAJIT) 5, 3 (2008).Google Scholar
- Amazon. 2018. Amazon Web Service. https://aws.amazon.com/. {Online; accessed July-2018}.Google Scholar
- Theodore Wilbur Anderson, Theodore Wilbur Anderson, Theodore Wilbur Anderson, Theodore Wilbur Anderson, and Etats-Unis Mathématicien. 1958. An introduction to multivariate statistical analysis. Vol. 2. Wiley New York.Google Scholar
- Titus Barik, Robert DeLine, Steven Drucker, and Danyel Fisher. 2016. The bones of the system: a case study of logging and telemetry at microsoft. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 92–101. Google ScholarDigital Library
- Ivan Beschastnikh, Yuriy Brun, Michael D Ernst, and Arvind Krishnamurthy. 2014. Inferring models of concurrent systems from logs of their behavior with CSight. In Proceedings of the 36th International Conference on Software Engineering. ACM, 468–479. Google ScholarDigital Library
- Marcello Cinque, Domenico Cotroneo, Raffaele Della Corte, and Antonio Pecchia. 2014. What logs should you look at when an application fails? insights from an industrial case study. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on. IEEE, 690–695. Google ScholarDigital Library
- Yingnong Dang, Rongxin Wu, Hongyu Zhang, Dongmei Zhang, and Peter Nobel. 2012. ReBucket: a method for clustering duplicate crash reports based on call stack similarity. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 1084–1093. Google ScholarDigital Library
- William Dickinson, David Leon, and Andy Podgurski. 2001. Finding failures by cluster analysis of execution profiles. In Proceedings of the 23rd international conference on Software engineering. IEEE Computer Society, 339–348. Google ScholarDigital Library
- Nicholas DiGiuseppe and James A Jones. 2012. Software behavior and failure clustering: An empirical study of fault causality. In Software Testing, Verification and Validation (ICST), 2012 IEEE Fifth International Conference on. IEEE, 191–200. Google ScholarDigital Library
- Rui Ding, Qiang Fu, Jian-Guang Lou, Qingwei Lin, Dongmei Zhang, Jiajun Shen, and Tao Xie. 2012. Healing online service systems via mining historical issue repositories. In Automated Software Engineering (ASE), 2012 Proceedings of the 27th IEEE/ACM International Conference on. IEEE, 318–321. Google ScholarDigital Library
- Rui Ding, Qiang Fu, Jian Guang Lou, Qingwei Lin, Dongmei Zhang, and Tao Xie. 2014. Mining historical issue repositories to heal large-scale online service systems. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on. IEEE, 311–322. Google ScholarDigital Library
- Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y Zomaya, Sebti Foufou, and Abdelaziz Bouras. 2014. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE transactions on emerging topics in computing 2, 3 (2014), 267–279.Google Scholar
- Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 149–158. Google ScholarDigital Library
- Qiang Fu, Jieming Zhu, Wenlu Hu, Jian-Guang Lou, Rui Ding, Qingwei Lin, Dongmei Zhang, and Tao Xie. 2014. Where do developers log? an empirical study on logging practices in industry. In Companion Proceedings of the 36th International Conference on Software Engineering. ACM, 24–33. Google ScholarDigital Library
- Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and techniques. Elsevier. Google ScholarDigital Library
- Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R Lyu. 2016. An evaluation study on log parsing and its use in log mining. In Dependable Systems and Networks (DSN), 2016 46th Annual IEEE/IFIP International Conference on. IEEE, 654–661.Google ScholarCross Ref
- James D. Herbsleb and Audris Mockus. 2003. An Empirical Study of Speed and Communication in Globally Distributed Software Development. IEEE Trans. Softw. Eng. 29, 6 (June 2003), 481–494. Google ScholarDigital Library
- Anil K Jain, M Narasimha Murty, and Patrick J Flynn. 1999. Data clustering: a review. ACM computing surveys (CSUR) 31, 3 (1999), 264–323. Google ScholarDigital Library
- Suhas Kabinna, Weiyi Shang, Cor-Paul Bezemer, and Ahmed E Hassan. 2016. Examining the stability of logging statements. In Software Analysis, Evolution, and Reengineering (SANER), 2016 IEEE 23rd International Conference on, Vol. 1. IEEE, 326–337.Google Scholar
- Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in ibm bluegene/l event logs. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on. IEEE, 583–588. Google ScholarDigital Library
- Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 102–111. Google ScholarDigital Library
- David Lo, Hong Cheng, Jiawei Han, Siau-Cheng Khoo, and Chengnian Sun. 2009. Classification of software behaviors for failure detection: a discriminative pattern mining approach. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 557–566. Google ScholarDigital Library
- Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. 2010. Mining Invariants from Console Logs for System Problem Detection.. In USENIX Annual Technical Conference. Google ScholarDigital Library
- Adetokunbo Makanju, A Nur Zincir-Heywood, and Evangelos E Milios. 2012. A lightweight algorithm for message type extraction in system application logs. IEEE Transactions on Knowledge and Data Engineering 24, 11 (2012), 1921–1936. Google ScholarDigital Library
- Microsoft. 2018. Microsoft Azure. https://azure.microsoft.com/en-us/. {Online; accessed July-2018}.Google Scholar
- Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 26–26. Google ScholarDigital Library
- Alina Oprea, Zhou Li, Ting-Fang Yen, Sang H Chin, and Sumayah Alrwais. 2015. Detection of early-stage enterprise infection by mining large-scale log data. In Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on. IEEE, 45–56. Google ScholarDigital Library
- Karthik Pattabiraman, Giancinto Paolo Saggese, Daniel Chen, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2011. Automated derivation of application-specific error detectors using dynamic analysis. IEEE Transactions on Dependable and Secure Computing 8, 5 (2011), 640–655. Google ScholarDigital Library
- Antonio Pecchia, Marcello Cinque, Gabriella Carrozza, and Domenico Cotroneo. 2015. Industry practices and event logging: Assessment of a critical software development process. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on, Vol. 2. IEEE, 169–178. Google ScholarDigital Library
- Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management 24, 5 (1988), 513–523. Google ScholarDigital Library
- Hinrich Schütze. 2008. Introduction to information retrieval. In Proceedings of the international communication of association for computing machinery conference.Google Scholar
- Weiyi Shang, Zhen Ming Jiang, Hadi Hemmati, Bram Adams, Ahmed E Hassan, and Patrick Martin. 2013. Assisting developers of big data analytics applications when deploying on hadoop clouds. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 402–411. Google ScholarDigital Library
- Weiyi Shang, Meiyappan Nagappan, and Ahmed E Hassan. 2015. Studying the relationship between logging characteristics and the code quality of platform software. Empirical Software Engineering 20, 1 (2015), 1–27. Google ScholarDigital Library
- Emad Shihab, Audris Mockus, Yasutaka Kamei, Bram Adams, and Ahmed E Hassan. 2011. High-impact defects: a study of breakage and surprise defects. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, 300–310. Google ScholarDigital Library
- Stanford. 2008. Evaluation of Clustering, NMI. https://nlp.stanford.edu/IR-book/ html/htmledition/evaluation-of-clustering-1.html. {Online; accessed July-2018}.Google Scholar
- Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM international conference on information and knowledge management. ACM, 785–794. Google ScholarDigital Library
- Risto Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In IP Operations & Management, 2003.(IPOM 2003). 3rd IEEE Workshop on. IEEE, 119–126.Google Scholar
- Wikipedia. 2018. Complete linkage clustering. https://en.wikipedia.org/wiki/ Complete-linkage_clustering. {Online; accessed July-2018}.Google Scholar
- Wikipedia. 2018. Multivariate normal distribution. https://en.wikipedia.org/wiki/ Multivariate_normal_distribution. {Online; accessed July-2018}.Google Scholar
- Wikipedia. 2018. Sigmoid Function. https://en.wikipedia.org/wiki/Sigmoid_ function. {Online; accessed July-2018}.Google Scholar
- Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 117–132. Google ScholarDigital Library
- Li Xuan, Chen Zhigang, and Yang Fan. 2013. Exploring of clustering algorithm on class-imbalanced data. In 2013 8th International Conference on Computer Science Education. 89–93.Google ScholarCross Ref
- Chun Yuan, Ni Lao, Ji-Rong Wen, Jiwei Li, Zheng Zhang, Yi-Min Wang, and Wei-Ying Ma. 2006. Automated known problem diagnosis with event traces. In ACM SIGOPS Operating Systems Review, Vol. 40. ACM, 375–388. Google ScholarDigital Library
- Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael Mihn-Jong Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging.. In OSDI, Vol. 12. 293–306. Google ScholarDigital Library
Index Terms
- Identifying impactful service system problems via log analysis
Recommendations
A Survey on Automated Log Analysis for Reliability Engineering
Invited TutorialLogs are semi-structured text generated by logging statements in software source code. In recent decades, software logs have become imperative in the reliability assurance mechanism of many software systems, because they are often the only data ...
Log clustering based problem identification for online service systems
ICSE '16: Proceedings of the 38th International Conference on Software Engineering CompanionLogs play an important role in the maintenance of large-scale online service systems. When an online service fails, engineers need to examine recorded logs to gain insights into the failure and identify the potential problems. Traditionally, engineers ...
Robust log-based anomaly detection on unstable log data
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringLogs are widely used by large and complex software-intensive systems for troubleshooting. There have been a lot of studies on log-based anomaly detection. To detect the anomalies, the existing methods mainly construct a detection model using log event ...
Comments