skip to main content
10.1145/1629575.1629587acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
research-article

Detecting large-scale system problems by mining console logs

Authors Info & Claims
Published:11 October 2009Publication History

ABSTRACT

Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software's internals.

References

  1. A.W. Appel. Modern Compiler Implementation in Java. Cambridge University Press, second edition, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Borthakur. The hadoop distributed file system: Architecture and design. Hadoop Project Website, 2007.Google ScholarGoogle Scholar
  3. M.Y. Chen and et al. Path-based failure and evolution management. In Proc. NSDI'04, pages 23--23, San Francisco, California, 2004. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M.H. DeGroot and M.J. Schervish. Probability and Statistics. Addison-Wesley, 3rd edition, 2002.Google ScholarGoogle Scholar
  5. R. Dunia and S.J. Qin. Multi-dimensional fault diagnosis using a subspace approach. In Proc. ACC, 1997.Google ScholarGoogle Scholar
  6. R. Feldman and J. Sanger. The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge Univ. Press, 12 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Fisher, D. Walker, K.Q. Zhu, and P. White. From dirt to shovels: fully automatic tool generation from ad hoc data. In Proceedings of ACM POPL'08, pages 421--434, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Fonseca and et al. Xtrace: A pervasive network tracing framework. In In Proc. NSDI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. Gulcu. Short introduction to log4j, March 2002. http://logging.apache.org/log4j.Google ScholarGoogle Scholar
  10. S.E. Hansen and E.T. Atkins. Automated system monitoring and notification with Swatch. In Proc. USENIX LISA '93, pages 145--152, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. E. Hatcher and O. Gospodnetic. Lucene in Action. Manning Publications Co., Greenwich, CT, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Hellerstein, S. Ma, and C. Perng. Discovering actionable patterns in event data. IBM Sys. Jour, 41(3), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J.E. Jackson and G.S. Mudholkar. Control procedures for residuals associated with principal component analysis. Technometrics, 21(3):341--349, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  14. W. Jiang and et al. Understanding customer problem troubleshooting from storage system logs. In Proceedings of USENIX FAST'09, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Jolliffe. Principal Component Analysis. Springer, 2002.Google ScholarGoogle Scholar
  16. A. Lakhina, M. Crovella, and C. Diot. Diagnosing network-wide traffic anomalies. In Proc. ACM SIGCOMM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Lim, N. Singh, and S. Yajnik. A log mining approach to failure analysis of enterprise telephony systems. In Proc. DSN, June 2008.Google ScholarGoogle Scholar
  18. S. Ma and J.L. Hellerstein. Mining partially periodic event patterns with unknown periods. In Proc. IEEE ICDE, Washington, DC, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A.A. Makanju, A.N. Zincir-Heywood, and E.E. Milios. Clustering event logs using iterative partitioning. In Proceedings of KDD'09, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Manning, P. Ragahavan, and et al. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. I. Mierswa, M. Wurst, R. Klinkenberg, M. Scholz, and T. Euler. Yale: Rapid prototyping for complex data mining tasks. In Proc. ACM KDD, New York, NY, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Oliner and J. Stearley. What supercomputers say: A study of five system logs. In Proc. IEEE DSN, Washington, DC, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. Papineni. Why inverse document frequency? In Proc. NAACL '01:, pages 1--8, Morristown, NJ, 2001. Asso. for Comp. Linguistics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J.E. Prewett. Analyzing cluster log files using logsurfer. In Proc. Annual Conf. on Linux Clusters, 2003.Google ScholarGoogle Scholar
  25. T. Sager, A. Bernstein, M. Pinzger, and C. Kiefer. Detecting similar java classes using tree algorithms. In Proc. ACM MSR '06, pages 65--71, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. G. Salton and C. Buckley. Term weighting approaches in automatic text retrieval. Technical report, Cornell, Ithaca, NY, USA, 1987.Google ScholarGoogle Scholar
  27. J. Stearley. Towards informatic analysis of syslogs. In Proc. IEEE CLUSTER, Washington, DC, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Sun. Project darkstar. www.projectdarkstar.com, 2008.Google ScholarGoogle Scholar
  29. Sun. Solaris Dynamic Tracing Guide, 2008.Google ScholarGoogle Scholar
  30. J. Tan and et al. SALSA: Analyzing logs as StAte machines. In Proc. of WASL '08, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Tan, D. Yuan, G. Krishna, and Y. Zhou. /*icomment: bugs or bad comments?*/. In Proc. ACM SOSP '07, New York, NY, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Vaarandi. A data clustering algorithm for mining patterns from event logs. Proc. IPOM, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  33. R. Vaarandi. A breadth-first algorithm for mining frequent patterns from event logs. In INTELLCOMM, volume 3283, pages 293--308. Springer, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  34. I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. K. Yamanishi and Y. Maruyama. Dynamic syslog mining for network failure monitoring. In Proc. ACM Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Detecting large-scale system problems by mining console logs

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SOSP '09: Proceedings of the ACM SIGOPS 22nd symposium on Operating systems principles
        October 2009
        346 pages
        ISBN:9781605587523
        DOI:10.1145/1629575

        Copyright © 2009 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 11 October 2009

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate131of716submissions,18%

        Upcoming Conference

        SOSP '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader