ABSTRACT
Modern engineering incorporates smart technologies in all aspects of our lives. Smart technologies are generating terabytes of log messages every day to report their status. It is crucial to analyze these log messages and present usable information (e.g. patterns) to administrators, so that they can manage and monitor these technologies. Patterns minimally represent large groups of log messages and enable the administrators to do further analysis, such as anomaly detection and event prediction. Although patterns exist commonly in automated log messages, recognizing them in massive set of log messages from heterogeneous sources without any prior information is a significant undertaking. We propose a method, named LogMine, that extracts high quality patterns for a given set of log messages. Our method is fast, memory efficient, accurate, and scalable. LogMine is implemented in map-reduce framework for distributed platforms to process millions of log messages in seconds. LogMine is a robust method that works for heterogeneous log messages generated in a wide variety of systems. Our method exploits algorithmic techniques to minimize the computational overhead based on the fact that log messages are always automatically generated. We evaluate the performance of LogMine on massive sets of log messages generated in industrial applications. LogMine has successfully generated patterns which are as good as the patterns generated by exact and unscalable method, while achieving a 500× speedup. Finally, we describe three applications of the patterns generated by LogMine in monitoring large scale industrial systems.
- Anonymous repository. https://files.secureserver.net/0fsleuxZLY7vjK.Google Scholar
- Benchmarking for DBSCAN and OPTICS. http://elki.dbs.ifi.lmu.de/wiki/Benchmarking.Google Scholar
- Elasticsearch: Store, Search, and Analyze. https://www.elastic.co/guide/index.html.Google Scholar
- EPA dataset. http://ita.ee.lbl.gov/html/contrib/EPA-HTTP.html.Google Scholar
- GrayLog. https://www.graylog.org.Google Scholar
- Internet of Things (IoT). http://www.cisco.com/web/solutions/trends/iot/overview.html.Google Scholar
- Log Management Explained. https://www.loggly.com/log-management-explained/.Google Scholar
- LogEntries. https://logentries.com/doc/.Google Scholar
- OSSIM (Open Source Security Information Management). https://en.wikipedia.org/wiki/OSSIM.Google Scholar
- SDSC dataset. http://ita.ee.lbl.gov/html/contrib/SDSC-HTTP.html.Google Scholar
- Splunk. http://www.splunk.com/en_us/solutions/solution-areas/internet-of-things%.html.Google Scholar
- Sumo Logic. https://www.sumologic.com/.Google Scholar
- M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: ordering points to identify the clustering structure. In ACM Sigmod Record, volume 28, pages 49--60. ACM, 1999. Google ScholarDigital Library
- S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 975--986. ACM, 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- C. Ding and J. Zhou. Log-based indexing to improve web site search. In Proceedings of the 2007 ACM Symposium on Applied Computing, SAC '07, pages 829--833, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- M. Eltahir and A. Dafa-Alla. Extracting knowledge from web server logs using web usage mining. In Computing, Electrical and Electronics Engineering (ICCEEE), 2013 International Conference on, pages 413--417, Aug 2013.Google ScholarCross Ref
- M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Kdd, volume 96, pages 226--231, 1996. Google ScholarDigital Library
- C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-series databases, volume 23. ACM, 1994. Google ScholarDigital Library
- E. Keogh and S. Kasetty. On the need for time series data mining benchmarks: a survey and empirical demonstration. Data Mining and knowledge discovery, 7(4):349--371, 2003. Google ScholarDigital Library
- G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. The unified logging infrastructure for data analytics at twitter. Proceedings of the VLDB Endowment, 5(12):1771--1780, 2012. Google ScholarDigital Library
- K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and B. Moon. Parallel data processing with mapreduce: a survey. AcM SIGMOD Record, 40(4):11--20, 2012. Google ScholarDigital Library
- C. D. Martino, S. Jha, W. Kramer, Z. Kalbarczyk, and R. K. Iyer. Logdiver: A tool for measuring resilience of extreme-scale systems and applications. In Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS '15, pages 11--18, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- A. Mueen, E. J. Keogh, Q. Zhu, S. Cash, and M. B. Westover. Exact discovery of time series motifs. In SDM, pages 473--484. SIAM, 2009.Google ScholarCross Ref
- X. Ning and G. Jiang.mboxHLAer: A system for heterogeneous log analysis, 2014. phSDM Workshop on Heterogeneous Learning.Google Scholar
- R. Rajachandrasekar, X. Besseron, and D. K. Panda. Monitoring and predicting hardware failures in hpc clusters with ftb-ipmi. In Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2012 IEEE 26th International, pages 1136--1143. IEEE, 2012. Google ScholarDigital Library
- T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh. Searching and mining trillions of time series subsequences under dynamic time warping. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 262--270. ACM, 2012. Google ScholarDigital Library
- K. S. Reddy, G. P. S. Varma, and I. R. Babu. Preprocessing the web server logs: An illustrative approach for effective usage mining. SIGSOFT Softw. Eng. Notes, 37(3):1--5, May 2012. Google ScholarDigital Library
- T. F. Smith and M. S. Waterman. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195--197, 1981.Google Scholar
- P. Sneath and R. Sokal. Unweighted pair group method with arithmetic mean. Numerical Taxonomy, pages 230--234, 1973.Google Scholar
- H. T. Vo, S. Wang, D. Agrawal, G. Chen, and B. C. Ooi. Logbase: A scalable log-structured database system in the cloud. Proc. VLDB Endow., 5(10):1004--1015, 2012. Google ScholarDigital Library
- Wikipedia. Dbscan -- wikipedia, the free encyclopedia. https://en.wikipedia.org/w/index.php?title=DBSCAN&oldid=672504091, 2015.Google Scholar
- C. Xu, S. Chen, and J. Cheng. Network user interest pattern mining based on entropy clustering algorithm. In Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2015 International Conference on, pages 200--204, Sept 2015. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, volume 10, page 10, 2010. Google ScholarDigital Library
Index Terms
- LogMine: Fast Pattern Recognition for Log Analytics
Recommendations
Robust log-based anomaly detection on unstable log data
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringLogs are widely used by large and complex software-intensive systems for troubleshooting. There have been a lot of studies on log-based anomaly detection. To detect the anomalies, the existing methods mainly construct a detection model using log event ...
Scale-out beyond map-reduce
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningThe amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...
An O((log log n)2) Time Algorithm to Compute the Convex Hull of Sorted Points on Reconfigurable Meshes
The problem of computing the convex hull of a set of n sorted points in the plane is one of the fundamental tasks in image processing, pattern recognition, cellular network design, and robotics, among many others. Somewhat surprisingly, in spite of a ...
Comments