ABSTRACT
Logging is a universal approach to recording important events in system workflows of distributed systems. Current log analysis tools ignore the semantic knowledge that is key to workflow construction and analysis. In addition, they focus on infrastructure-level distributed systems. Because of fundamental differences in log features, they are ineffective in distributed data analytics systems. This paper proposes IntelLog, a semantic-aware non-intrusive workflow reconstruction tool for distributed data analytics systems. It is capable of building hierarchical relationships between components and events from logs generated by the targeted systems with little or even no domain knowledge. Leveraging natural language processing, IntelLog automatically extracts and formats semantic information in each log message, including system events, identifiers, locality information, and metrics values. It builds a graph to represent the hierarchical relationship of components in the targeted system via nomenclature conventions. We implement IntelLog for Hadoop MapReduce, Spark and Tez. Evaluation results show that IntelLog provides a fine-grained view of the system workflows with semantics. It outperforms existing tools in automatically detecting anomalies caused by real-world problems, misconfigurations and system bugs. Users can query the formatted semantic knowledge to understand and further troubleshoot the systems.
- Graphite. https://graphite.readthedocs.io/.Google Scholar
- JSONQuery. https://github.com/burt202/jsonquery/.Google Scholar
- OpenNLP. https://opennlp.apache.org/, a .Google Scholar
- OpenStack. https://www.openstack.org/, b .Google Scholar
- OpenTSDB. http://opentsdb.net//, c .Google Scholar
- Spark-19371. https://issues.apache.org/jira/browse/SPARK-19371/.Google Scholar
- TPC-H. http://www.tpc.org/tpch/.Google Scholar
- TensorFlow. https://www.tensorflow.org/.Google Scholar
- I. Beschastnikh, Y. Brun, S. Schneider, M. Sloan, and M. D. Ernst. Leveraging existing instrumentation to automatically infer invariant-constrained models. In Proc. of ACM SIGSOFT ESEC/FSE, 2011. Google ScholarDigital Library
- D. Borthakur. Hdfs architecture guide. hadoop apache project, 2008.Google Scholar
- Brid, Steven, E. Loper, and E. Klein. Natural Language Processing with Python. O'Reilly Media Inc., 2009.Google Scholar
- B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic instrumentation of production systems. In Proc. of USENIX ATC, 2004.Google ScholarDigital Library
- D. Chen and C. D. Manning. A fast and accurate dependency parser using neural networks. In Proc. of ACL EMNLP, 2014.Google ScholarCross Ref
- W. Chen, J. Rao, and X. Zhou. Preemptive, low latency datacenter scheduling via lightweight virtualization. In Proc. of USENIX ATC, 2017. Google ScholarDigital Library
- W. Chen, A. Pi, S. Wang, and X. Zhou. Characterizing scheduling delay for low-latency data analytic workloads. In Proc. of IEEE IPDPS, 2018.Google ScholarCross Ref
- D. J. Dean, H. Nguyen, X. Gu, H. Zhang, J. Rhee, Nipun, Arora, and G. Jiang. Perfscope: Practical online server performance bug inference in production cloud computing infrastructures. In Proc. of ACM SoCC, 2014. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In Proc. of ACM Communications, 2008. Google ScholarDigital Library
- M. Du and F. Li. Spell: Streaming parsing of system event logs. In Proc. of IEEE ICDM, 2017.Google Scholar
- M. Du, F. Li, G. Zheng, and V. Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proc. of ACM CCS, 2017. Google ScholarDigital Library
- S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The HiBench benchmark suite: Characterization of the mapreduce-based data analysis. In Proc. of IEEE Data Engineering Workshops (ICDEW), 2010.Google ScholarCross Ref
- J. S. Justeson and S. M. Katz. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1995.Google ScholarCross Ref
- Q. Lin, H. Zhang, J.-G. Lou, Y. Zhang, and X. Chen. Log clustering based problem identification for online service systems. In Proc. of IEEE/ACM ICSE, 2016.Google ScholarDigital Library
- L. Luo, S. Nath, L. R. Sivalingam, M. Musuvathi, and L. Ceze. Troubleshooting, transiently-recurring problems in production systems with blame-proportional logging. In Proc. of USENIX ATC, 2018. Google ScholarDigital Library
- J. Mace, R. Roelke, and R. Fonseca. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proc. of ACM SOSP, 2015. Google ScholarDigital Library
- M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19 (2): 313--330, June 1993. ISSN 0891--2017. Google ScholarDigital Library
- M. Mejbah ul Alam, T. Liu, G. Zeng, and A. Muzahid. Syncperf: Categorizing, detecting, and diagnosing synchronization performance bugs. In Proc. of ACM Eurosys, 2017.Google ScholarDigital Library
- K. Nagaraj, C. Killian, and J. Neville. Structured comparative analysis of systems logs to diagnose performance problems. In Proc. of USENIX NSDI, 2012. Google ScholarDigital Library
- J. Nivre, M.-C. Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman. Universal dependencies v1: A multilingual treebank collection. In Proc. of LREC, 2016.Google Scholar
- A. Pi, W. Chen, X. Zhou, and M. Ji. Profiling distributed systems in lightweight virtualized environments with logs and resource metrics. In Proc. of ACM HPDC, 2018.Google Scholar
- A. Pi, W. Chen, W. Zeller, and X. Zhou. It can understand the logs, literally. In Proc. of IPDPSW, 2019.Google ScholarCross Ref
- R. Potharaju, N. Jain, and C. Nita-Rotaru. Juggling the jigsaw: Towards automated problem inference from network trouble tickets. In Proc. of USENIX NSDI, 2013. Google ScholarDigital Library
- B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. Murthy, and C. Curino. Apache tez: A unifying framework for modeling and building data processing applications. In Proc. of ACM SIGMOD, 2015. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. of VLDB Endowment, 2009. Google ScholarDigital Library
- K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. of HLT-NAACL, 2003.Google ScholarDigital Library
- V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache Hadoop YARN: Yet another resource negotiator. In Proc. of ACM SoCC, 2013. Google ScholarDigital Library
- M. Yamamoto and K. W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27 (1): 1--30, Mar. 2001. ISSN 0891--2017. Google ScholarDigital Library
- X. Yu, P. Joshi, J. Xu, and G. Jin. CloudSeer: Workflow monitoring of cloud infrastructures via interleaved logs. In Proc. of ACM ASPLOS, 2016. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proc. of USENIX HOTCLOUD, 2010. Google ScholarDigital Library
- X. Zhao, Y. Zhang, D. Lion, M. FaizanUllah, Y. Luo, D. Yuan, and M. Stumm. Iprof: A non-intrusive request flow profiler for distributed systems. In Proc. of USENIX OSDI, 2014. Google ScholarDigital Library
- X. Zhao, K. Rodrigues, Y. Luo, D. Yuan, and M. Stumm. Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In Proc. of USENIX OSDI, 2016. Google ScholarDigital Library
Index Terms
- Semantic-aware Workflow Construction and Analysis for Distributed Data Analytics Systems
Recommendations
Troubleshooting distributed data analytics systems
Middleware '19: Proceedings of the 20th International Middleware Conference Doctoral SymposiumData analytics applications are deployed on large-scale distributed systems. In order to ensure high performance, troubleshooting for such applications and underlying systems is critical.
In this thesis, we focus on efficient log analysis for ...
A grid workflow environment for brain imaging analysis on distributed systems
Special Issue: 3rd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2008)Scientific applications like neuroscience data analysis are usually compute and data-intensive. With the use of the additional capacity offered by distributed resources and suitable middlewares, we can achieve much shorter execution time, distribute ...
Comments