research-article

Public Access

Semantic-aware Workflow Construction and Analysis for Distributed Data Analytics Systems

Authors:
Aidi Pi

University of Colorado, Colorado Springs, Colorado Springs, CO, USA

University of Colorado, Colorado Springs, Colorado Springs, CO, USA
View Profile

,
Wei Chen

University of Colorado, Colorado Springs, Colorado Springs, CO, USA

University of Colorado, Colorado Springs, Colorado Springs, CO, USA
View Profile

,
Shaoqi Wang

University of Colorado, Colorado Springs, Colorado Springs, CO, USA

University of Colorado, Colorado Springs, Colorado Springs, CO, USA
View Profile

,
Xiaobo Zhou

University of Colorado, Colorado Springs, Colorado Springs, CO, USA

University of Colorado, Colorado Springs, Colorado Springs, CO, USA
View Profile

HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed ComputingJune 2019Pages 255–266https://doi.org/10.1145/3307681.3325404

Published:17 June 2019Publication History

HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing

Pages 255–266

ABSTRACT

Logging is a universal approach to recording important events in system workflows of distributed systems. Current log analysis tools ignore the semantic knowledge that is key to workflow construction and analysis. In addition, they focus on infrastructure-level distributed systems. Because of fundamental differences in log features, they are ineffective in distributed data analytics systems. This paper proposes IntelLog, a semantic-aware non-intrusive workflow reconstruction tool for distributed data analytics systems. It is capable of building hierarchical relationships between components and events from logs generated by the targeted systems with little or even no domain knowledge. Leveraging natural language processing, IntelLog automatically extracts and formats semantic information in each log message, including system events, identifiers, locality information, and metrics values. It builds a graph to represent the hierarchical relationship of components in the targeted system via nomenclature conventions. We implement IntelLog for Hadoop MapReduce, Spark and Tez. Evaluation results show that IntelLog provides a fine-grained view of the system workflows with semantics. It outperforms existing tools in automatically detecting anomalies caused by real-world problems, misconfigurations and system bugs. Users can query the formatted semantic knowledge to understand and further troubleshoot the systems.

References

Graphite. https://graphite.readthedocs.io/.Google Scholar
JSONQuery. https://github.com/burt202/jsonquery/.Google Scholar
OpenNLP. https://opennlp.apache.org/, a .Google Scholar
OpenStack. https://www.openstack.org/, b .Google Scholar
OpenTSDB. http://opentsdb.net//, c .Google Scholar
Spark-19371. https://issues.apache.org/jira/browse/SPARK-19371/.Google Scholar
TPC-H. http://www.tpc.org/tpch/.Google Scholar
TensorFlow. https://www.tensorflow.org/.Google Scholar
I. Beschastnikh, Y. Brun, S. Schneider, M. Sloan, and M. D. Ernst. Leveraging existing instrumentation to automatically infer invariant-constrained models. In Proc. of ACM SIGSOFT ESEC/FSE, 2011. Google ScholarDigital Library
D. Borthakur. Hdfs architecture guide. hadoop apache project, 2008.Google Scholar
Brid, Steven, E. Loper, and E. Klein. Natural Language Processing with Python. O'Reilly Media Inc., 2009.Google Scholar
B. M. Cantrill, M. W. Shapiro, and A. H. Leventhal. Dynamic instrumentation of production systems. In Proc. of USENIX ATC, 2004.Google ScholarDigital Library
D. Chen and C. D. Manning. A fast and accurate dependency parser using neural networks. In Proc. of ACL EMNLP, 2014.Google ScholarCross Ref
W. Chen, J. Rao, and X. Zhou. Preemptive, low latency datacenter scheduling via lightweight virtualization. In Proc. of USENIX ATC, 2017. Google ScholarDigital Library
W. Chen, A. Pi, S. Wang, and X. Zhou. Characterizing scheduling delay for low-latency data analytic workloads. In Proc. of IEEE IPDPS, 2018.Google ScholarCross Ref
D. J. Dean, H. Nguyen, X. Gu, H. Zhang, J. Rhee, Nipun, Arora, and G. Jiang. Perfscope: Practical online server performance bug inference in production cloud computing infrastructures. In Proc. of ACM SoCC, 2014. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In Proc. of ACM Communications, 2008. Google ScholarDigital Library
M. Du and F. Li. Spell: Streaming parsing of system event logs. In Proc. of IEEE ICDM, 2017.Google Scholar
M. Du, F. Li, G. Zheng, and V. Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. In Proc. of ACM CCS, 2017. Google ScholarDigital Library
S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang. The HiBench benchmark suite: Characterization of the mapreduce-based data analysis. In Proc. of IEEE Data Engineering Workshops (ICDEW), 2010.Google ScholarCross Ref
J. S. Justeson and S. M. Katz. Technical terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1995.Google ScholarCross Ref
Q. Lin, H. Zhang, J.-G. Lou, Y. Zhang, and X. Chen. Log clustering based problem identification for online service systems. In Proc. of IEEE/ACM ICSE, 2016.Google ScholarDigital Library
L. Luo, S. Nath, L. R. Sivalingam, M. Musuvathi, and L. Ceze. Troubleshooting, transiently-recurring problems in production systems with blame-proportional logging. In Proc. of USENIX ATC, 2018. Google ScholarDigital Library
J. Mace, R. Roelke, and R. Fonseca. Pivot tracing: Dynamic causal monitoring for distributed systems. In Proc. of ACM SOSP, 2015. Google ScholarDigital Library
M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, 19 (2): 313--330, June 1993. ISSN 0891--2017. Google ScholarDigital Library
M. Mejbah ul Alam, T. Liu, G. Zeng, and A. Muzahid. Syncperf: Categorizing, detecting, and diagnosing synchronization performance bugs. In Proc. of ACM Eurosys, 2017.Google ScholarDigital Library
K. Nagaraj, C. Killian, and J. Neville. Structured comparative analysis of systems logs to diagnose performance problems. In Proc. of USENIX NSDI, 2012. Google ScholarDigital Library
J. Nivre, M.-C. Marneffe, F. Ginter, Y. Goldberg, J. Hajic, C. D. Manning, R. McDonald, S. Petrov, S. Pyysalo, N. Silveira, R. Tsarfaty, and D. Zeman. Universal dependencies v1: A multilingual treebank collection. In Proc. of LREC, 2016.Google Scholar
A. Pi, W. Chen, X. Zhou, and M. Ji. Profiling distributed systems in lightweight virtualized environments with logs and resource metrics. In Proc. of ACM HPDC, 2018.Google Scholar
A. Pi, W. Chen, W. Zeller, and X. Zhou. It can understand the logs, literally. In Proc. of IPDPSW, 2019.Google ScholarCross Ref
R. Potharaju, N. Jain, and C. Nita-Rotaru. Juggling the jigsaw: Towards automated problem inference from network trouble tickets. In Proc. of USENIX NSDI, 2013. Google ScholarDigital Library
B. Saha, H. Shah, S. Seth, G. Vijayaraghavan, A. Murthy, and C. Curino. Apache tez: A unifying framework for modeling and building data processing applications. In Proc. of ACM SIGMOD, 2015. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. of VLDB Endowment, 2009. Google ScholarDigital Library
K. Toutanova, D. Klein, C. D. Manning, and Y. Singer. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. of HLT-NAACL, 2003.Google ScholarDigital Library
V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, et al. Apache Hadoop YARN: Yet another resource negotiator. In Proc. of ACM SoCC, 2013. Google ScholarDigital Library
M. Yamamoto and K. W. Church. Using suffix arrays to compute term frequency and document frequency for all substrings in a corpus. Computational Linguistics, 27 (1): 1--30, Mar. 2001. ISSN 0891--2017. Google ScholarDigital Library
X. Yu, P. Joshi, J. Xu, and G. Jin. CloudSeer: Workflow monitoring of cloud infrastructures via interleaved logs. In Proc. of ACM ASPLOS, 2016. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proc. of USENIX HOTCLOUD, 2010. Google ScholarDigital Library
X. Zhao, Y. Zhang, D. Lion, M. FaizanUllah, Y. Luo, D. Yuan, and M. Stumm. Iprof: A non-intrusive request flow profiler for distributed systems. In Proc. of USENIX OSDI, 2014. Google ScholarDigital Library
X. Zhao, K. Rodrigues, Y. Luo, D. Yuan, and M. Stumm. Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In Proc. of USENIX OSDI, 2016. Google ScholarDigital Library

Index Terms

Recommendations

Troubleshooting distributed data analytics systems
Middleware '19: Proceedings of the 20th International Middleware Conference Doctoral Symposium

Data analytics applications are deployed on large-scale distributed systems. In order to ensure high performance, troubleshooting for such applications and underlying systems is critical.

In this thesis, we focus on efficient log analysis for ...
Read More
Big Data Analytics
Read More
A grid workflow environment for brain imaging analysis on distributed systems
Special Issue: 3rd International Workshop on Workflow Management and Applications in Grid Environments (WaGe2008)

Scientific applications like neuroscience data analysis are usually compute and data-intensive. With the use of the additional capacity offered by distributed resources and suitable middlewares, we can achieve much shorter execution time, distribute ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing
June 2019
278 pages
ISBN:9781450366700
DOI:10.1145/3307681
General Chair:
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Ali R. Butt
Virginia Tech, USA
,
Evgenia Smirni
College of William and Mary, USA
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed systems
log profiling
natural language processing
troubleshooting
workflow construction
Qualifiers
- research-article
Conference

Acceptance Rates
HPDC '19 Paper Acceptance Rate22of106submissions,21%Overall Acceptance Rate166of966submissions,17%
More
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 530
  Total Downloads
- Downloads (Last 12 months)70
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Semantic-aware Workflow Construction and Analysis for Distributed Data Analytics Systems

HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Troubleshooting distributed data analytics systems

Big Data Analytics

A grid workflow environment for brain imaging analysis on distributed systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Semantic-aware Workflow Construction and Analysis for Distributed Data Analytics Systems

HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Troubleshooting distributed data analytics systems

Big Data Analytics

A grid workflow environment for brain imaging analysis on distributed systems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media