skip to main content
10.1145/3236024.3236083acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

Identifying impactful service system problems via log analysis

Published:26 October 2018Publication History

ABSTRACT

Logs are often used for troubleshooting in large-scale software systems. For a cloud-based online system that provides 24/7 service, a huge number of logs could be generated every day. However, these logs are highly imbalanced in general, because most logs indicate normal system operations, and only a small percentage of logs reveal impactful problems. Problems that lead to the decline of system KPIs (Key Performance Indicators) are impactful and should be fixed by engineers with a high priority. Furthermore, there are various types of system problems, which are hard to be distinguished manually. In this paper, we propose Log3C, a novel clustering-based approach to promptly and precisely identify impactful system problems, by utilizing both log sequences (a sequence of log events) and system KPIs. More specifically, we design a novel cascading clustering algorithm, which can greatly save the clustering time while keeping high accuracy by iteratively sampling, clustering, and matching log sequences. We then identify the impactful problems by correlating the clusters of log sequences with system KPIs. Log3C is evaluated on real-world log data collected from an online service system at Microsoft, and the results confirm its effectiveness and efficiency. Furthermore, our approach has been successfully applied in industrial practice.

References

  1. Osama Abu Abbas. 2008. Comparisons Between Data Clustering Algorithms. International Arab Journal of Information Technology (IAJIT) 5, 3 (2008).Google ScholarGoogle Scholar
  2. Amazon. 2018. Amazon Web Service. https://aws.amazon.com/. {Online; accessed July-2018}.Google ScholarGoogle Scholar
  3. Theodore Wilbur Anderson, Theodore Wilbur Anderson, Theodore Wilbur Anderson, Theodore Wilbur Anderson, and Etats-Unis Mathématicien. 1958. An introduction to multivariate statistical analysis. Vol. 2. Wiley New York.Google ScholarGoogle Scholar
  4. Titus Barik, Robert DeLine, Steven Drucker, and Danyel Fisher. 2016. The bones of the system: a case study of logging and telemetry at microsoft. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 92–101. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ivan Beschastnikh, Yuriy Brun, Michael D Ernst, and Arvind Krishnamurthy. 2014. Inferring models of concurrent systems from logs of their behavior with CSight. In Proceedings of the 36th International Conference on Software Engineering. ACM, 468–479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Marcello Cinque, Domenico Cotroneo, Raffaele Della Corte, and Antonio Pecchia. 2014. What logs should you look at when an application fails? insights from an industrial case study. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on. IEEE, 690–695. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yingnong Dang, Rongxin Wu, Hongyu Zhang, Dongmei Zhang, and Peter Nobel. 2012. ReBucket: a method for clustering duplicate crash reports based on call stack similarity. In Proceedings of the 34th International Conference on Software Engineering. IEEE Press, 1084–1093. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. William Dickinson, David Leon, and Andy Podgurski. 2001. Finding failures by cluster analysis of execution profiles. In Proceedings of the 23rd international conference on Software engineering. IEEE Computer Society, 339–348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Nicholas DiGiuseppe and James A Jones. 2012. Software behavior and failure clustering: An empirical study of fault causality. In Software Testing, Verification and Validation (ICST), 2012 IEEE Fifth International Conference on. IEEE, 191–200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Rui Ding, Qiang Fu, Jian-Guang Lou, Qingwei Lin, Dongmei Zhang, Jiajun Shen, and Tao Xie. 2012. Healing online service systems via mining historical issue repositories. In Automated Software Engineering (ASE), 2012 Proceedings of the 27th IEEE/ACM International Conference on. IEEE, 318–321. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Rui Ding, Qiang Fu, Jian Guang Lou, Qingwei Lin, Dongmei Zhang, and Tao Xie. 2014. Mining historical issue repositories to heal large-scale online service systems. In Dependable Systems and Networks (DSN), 2014 44th Annual IEEE/IFIP International Conference on. IEEE, 311–322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Adil Fahad, Najlaa Alshatri, Zahir Tari, Abdullah Alamri, Ibrahim Khalil, Albert Y Zomaya, Sebti Foufou, and Abdelaziz Bouras. 2014. A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE transactions on emerging topics in computing 2, 3 (2014), 267–279.Google ScholarGoogle Scholar
  13. Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis. In Data Mining, 2009. ICDM’09. Ninth IEEE International Conference on. IEEE, 149–158. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Qiang Fu, Jieming Zhu, Wenlu Hu, Jian-Guang Lou, Rui Ding, Qingwei Lin, Dongmei Zhang, and Tao Xie. 2014. Where do developers log? an empirical study on logging practices in industry. In Companion Proceedings of the 36th International Conference on Software Engineering. ACM, 24–33. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jiawei Han, Jian Pei, and Micheline Kamber. 2011. Data mining: concepts and techniques. Elsevier. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R Lyu. 2016. An evaluation study on log parsing and its use in log mining. In Dependable Systems and Networks (DSN), 2016 46th Annual IEEE/IFIP International Conference on. IEEE, 654–661.Google ScholarGoogle ScholarCross RefCross Ref
  17. James D. Herbsleb and Audris Mockus. 2003. An Empirical Study of Speed and Communication in Globally Distributed Software Development. IEEE Trans. Softw. Eng. 29, 6 (June 2003), 481–494. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Anil K Jain, M Narasimha Murty, and Patrick J Flynn. 1999. Data clustering: a review. ACM computing surveys (CSUR) 31, 3 (1999), 264–323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Suhas Kabinna, Weiyi Shang, Cor-Paul Bezemer, and Ahmed E Hassan. 2016. Examining the stability of logging statements. In Software Analysis, Evolution, and Reengineering (SANER), 2016 IEEE 23rd International Conference on, Vol. 1. IEEE, 326–337.Google ScholarGoogle Scholar
  20. Yinglung Liang, Yanyong Zhang, Hui Xiong, and Ramendra Sahoo. 2007. Failure prediction in ibm bluegene/l event logs. In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on. IEEE, 583–588. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen. 2016. Log clustering based problem identification for online service systems. In Proceedings of the 38th International Conference on Software Engineering Companion. ACM, 102–111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. David Lo, Hong Cheng, Jiawei Han, Siau-Cheng Khoo, and Chengnian Sun. 2009. Classification of software behaviors for failure detection: a discriminative pattern mining approach. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 557–566. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li. 2010. Mining Invariants from Console Logs for System Problem Detection.. In USENIX Annual Technical Conference. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Adetokunbo Makanju, A Nur Zincir-Heywood, and Evangelos E Milios. 2012. A lightweight algorithm for message type extraction in system application logs. IEEE Transactions on Knowledge and Data Engineering 24, 11 (2012), 1921–1936. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Microsoft. 2018. Microsoft Azure. https://azure.microsoft.com/en-us/. {Online; accessed July-2018}.Google ScholarGoogle Scholar
  26. Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation. USENIX Association, 26–26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Alina Oprea, Zhou Li, Ting-Fang Yen, Sang H Chin, and Sumayah Alrwais. 2015. Detection of early-stage enterprise infection by mining large-scale log data. In Dependable Systems and Networks (DSN), 2015 45th Annual IEEE/IFIP International Conference on. IEEE, 45–56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Karthik Pattabiraman, Giancinto Paolo Saggese, Daniel Chen, Zbigniew Kalbarczyk, and Ravishankar Iyer. 2011. Automated derivation of application-specific error detectors using dynamic analysis. IEEE Transactions on Dependable and Secure Computing 8, 5 (2011), 640–655. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Antonio Pecchia, Marcello Cinque, Gabriella Carrozza, and Domenico Cotroneo. 2015. Industry practices and event logging: Assessment of a critical software development process. In Software Engineering (ICSE), 2015 IEEE/ACM 37th IEEE International Conference on, Vol. 2. IEEE, 169–178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. Information processing & management 24, 5 (1988), 513–523. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Hinrich Schütze. 2008. Introduction to information retrieval. In Proceedings of the international communication of association for computing machinery conference.Google ScholarGoogle Scholar
  32. Weiyi Shang, Zhen Ming Jiang, Hadi Hemmati, Bram Adams, Ahmed E Hassan, and Patrick Martin. 2013. Assisting developers of big data analytics applications when deploying on hadoop clouds. In Proceedings of the 2013 International Conference on Software Engineering. IEEE Press, 402–411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Weiyi Shang, Meiyappan Nagappan, and Ahmed E Hassan. 2015. Studying the relationship between logging characteristics and the code quality of platform software. Empirical Software Engineering 20, 1 (2015), 1–27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Emad Shihab, Audris Mockus, Yasutaka Kamei, Bram Adams, and Ahmed E Hassan. 2011. High-impact defects: a study of breakage and surprise defects. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. ACM, 300–310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Stanford. 2008. Evaluation of Clustering, NMI. https://nlp.stanford.edu/IR-book/ html/htmledition/evaluation-of-clustering-1.html. {Online; accessed July-2018}.Google ScholarGoogle Scholar
  36. Liang Tang, Tao Li, and Chang-Shing Perng. 2011. LogSig: Generating system events from raw textual logs. In Proceedings of the 20th ACM international conference on information and knowledge management. ACM, 785–794. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Risto Vaarandi. 2003. A data clustering algorithm for mining patterns from event logs. In IP Operations & Management, 2003.(IPOM 2003). 3rd IEEE Workshop on. IEEE, 119–126.Google ScholarGoogle Scholar
  38. Wikipedia. 2018. Complete linkage clustering. https://en.wikipedia.org/wiki/ Complete-linkage_clustering. {Online; accessed July-2018}.Google ScholarGoogle Scholar
  39. Wikipedia. 2018. Multivariate normal distribution. https://en.wikipedia.org/wiki/ Multivariate_normal_distribution. {Online; accessed July-2018}.Google ScholarGoogle Scholar
  40. Wikipedia. 2018. Sigmoid Function. https://en.wikipedia.org/wiki/Sigmoid_ function. {Online; accessed July-2018}.Google ScholarGoogle Scholar
  41. Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan. 2009. Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles. ACM, 117–132. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Li Xuan, Chen Zhigang, and Yang Fan. 2013. Exploring of clustering algorithm on class-imbalanced data. In 2013 8th International Conference on Computer Science Education. 89–93.Google ScholarGoogle ScholarCross RefCross Ref
  43. Chun Yuan, Ni Lao, Ji-Rong Wen, Jiwei Li, Zheng Zhang, Yi-Min Wang, and Wei-Ying Ma. 2006. Automated known problem diagnosis with event traces. In ACM SIGOPS Operating Systems Review, Vol. 40. ACM, 375–388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Ding Yuan, Soyeon Park, Peng Huang, Yang Liu, Michael Mihn-Jong Lee, Xiaoming Tang, Yuanyuan Zhou, and Stefan Savage. 2012. Be Conservative: Enhancing Failure Diagnosis with Proactive Logging.. In OSDI, Vol. 12. 293–306. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Identifying impactful service system problems via log analysis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering
        October 2018
        987 pages
        ISBN:9781450355735
        DOI:10.1145/3236024

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 October 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate112of543submissions,21%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader