skip to main content
research-article
Open access

Debugging Distributed Systems: Challenges and options for validation and debugging

Published: 28 March 2016 Publication History

Abstract

Distributed systems pose unique challenges for software developers. Reasoning about concurrent activities of system nodes and even understanding the system’s communication topology can be difficult. A standard approach to gaining insight into system activity is to analyze system logs. Unfortunately, this can be a tedious and complex process. This article looks at several key features and debugging challenges that differentiate distributed systems from other kinds of software. The article presents several promising tools and ongoing research to help resolve these challenges.

References

[1]
Bernstein, P., Hadzilacos, V., Goodman, N. 1986. Distributed recovery. In Concurrency Control and Recovery in Database Systems, Chapter 7. Addison-Wesley; http://research.microsoft.com/en-us/people/philbe/chapter7.pdf.
[2]
Corbett, J. C., Dean, J., Epstein, M., Fikes, A., Frost, C., Furman, J. J., Ghemawat, S., Gubarev, A., Heiser, C., Hochschild, P., Hsieh, W., Kanthak, S., Kogan, E., Li, H., Lloyd, A., Melnik, S., Mwaura, D., Nagle, D., Quinlan, S., Rao, R., Rolig, L., Saito, Y., Szymaniak, M., Taylor, C., Wang, R., Woodford, D. 2012. Spanner: Google's globally distributed database. 10th Usenix Symposium on Operating Systems Design and Implementation; https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett.
[3]
Garduno, E., Kavulya, S. P., Tan, J., Gandhi, R., Narasimhan, P. 2012. Theia: visual signatures for problem diagnosis in large Hadoop clusters. Proceedings of the 26th International Conference on Large Installation System Administration: 33-42; https://users.ece.cmu.edu/~spertet/papers/hadoopvis-lisa12-cameraready-v3.pdf.
[4]
Geels, D., Altekar, G., Maniatis, P., Roscoe, T., Stoica, I. 2007. Friday: global comprehension for distributed replay. Proceedings of the Fourth Usenix Conference on Networked Systems Design and Implementation; https://www.usenix.org/legacy/event/nsdi07/tech/full_papers/geels/geels.pdf.
[5]
Hawblitzel, C., Howell, J., Kapritsos, M., Lorch, J. R., Parno, B., Roberts, M. L., Setty, S., Zill, B. 2015. IronFleet: proving practical distributed systems correct. Proceedings of the 25th Symposium on Operating Systems Principles; http://sigops.org/sosp/sosp15/current/2015-Monterey/250-hawblitzel-online.pdf.
[6]
Killian, C., Anderson, J. W., Jhala, R., Vahdat, A. 2007. Life, death, and the critical transition:finding liveness bugs in systems code. Proceedings of the Fourth Usenix Conference on Networked Systems Design and Implementation; https://www.usenix.org/legacy/event/nsdi07/tech/killian/killian.pdf.
[7]
Liu, X., Guo, Z., Wang, X., Chen, F., Lian, X., Tang, J., Wu, M., Kaashoek, M. F., Zhang, Z. 2008. D3S: debugging deployed distributed systems. Proceedings of the Fifth Usenix Symposium on Networked Systems Design and Implementation: 423-437; http://static.usenix.org/event/nsdi08/tech/full_papers/liu_xuezheng/liu_xuezheng.pdf.
[8]
Mace, J., Roelke, R., Fonseca, R. 2015. Pivot tracing: dynamic causal monitoring for distributed systems. Proceedings of the 25th Symposium on Operating Systems Principles: 378-393; http://sigops.org/sosp/sosp15/current/2015-Monterey/122-mace-online.pdf.
[9]
Mattern, F. 1989. Virtual time and global states of distributed systems. Proceedings of the International Workshop on Parallel and Distributed Algorithms; http://homes.cs.washington.edu/~arvind/cs425/doc/mattern89virtual.pdf
[10]
Newcombe, C., Rath, T., Zhang, F., Munteanu, B., Brooker, M., Deardeuff, M. 2015. How Amazon Web Services uses formal methods. Communications of the ACM 58(4): 66-73; http://cacm.acm.org/magazines/2015/4/184701-how-amazon-web-services-uses-formal-methods/fulltext.
[11]
Project Voldemort; http://www.project-voldemort.com/voldemort/.
[12]
Sambasivan, R. R., Fonseca, R., Shafer, I., Ganger, G. 2014. So, you want to trace your distributed system? Key design insights from years of practical experience. Parallel Data Laboratory, Carnegie Mellon University; http://www.pdl.cmu.edu/PDL-FTP/SelfStar/CMU-PDL-14-102.pdf.
[13]
Scott, C., Wundsam, A., Raghavan, B., Panda, A., Or, A., Lai, J., Huang, E., Liu, Z., El-Hassany, A., Whitlock, S., Acharya, H. B., Zarifis, K., Shenker, S. 2014. Troubleshooting blackbox SDN control software with minimal causal sequences. Proceedings of the ACM Conference on SIGCOMM: 395-406; https://www.eecs.berkeley.edu/~apanda/papers/sts.pdf.
[14]
Sigelman, B. H., Barroso, L. A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C. 2010. Dapper, a large-scale distributed systems tracing infrastructure. Research at Google; http://research.google.com/pubs/pub36356.html.
[15]
Wilcox, J. R., Woos, D., Panchekha, P., Tatlock, Z., Wang, X., Ernst, M. D., Anderson, T. 2015. Verdi: a framework for implementing and formally verifying distributed systems. Proceedings of the 36th SIGPLAN Conference on Programming Language Design and Implementation: 357-368; https://homes.cs.washington.edu/~ztatlock/pubs/verdi-wilcox-pldi15.pdf.
[16]
Xu, W., Huang, L., Fox, A., Patterson, D., Jordan, M. 2010. Experience mining Google's production console logs. Proceedings of the Workshop on Managing Systems via Log Analysis and Machine Learning Techniques; http://iiis.tsinghua.edu.cn/~weixu/files/slaml10.pdf.
[17]
Yang, J., Chen, T., Wu, M., Xu, Z., Liu, X., Lin, H., Yang, M., Long, F., Zhang, L., Zhou, L. 2009. MoDist: transparent model checking of unmodified distributed systems. Proceedings of the Sixth Usenix Symposium on Networked Systems Design and Implementation: 213-228; https://www.usenix.org/legacy/event/nsdi09/tech/full_papers/yang/yang_html/.

Cited By

View all
  • (2024)A Novel Approach to Optimizing Spark: Detecting Misconfigurations for Enhanced Efficiency2024 International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS)10.1109/ICICNIS64247.2024.10823258(1045-1050)Online publication date: 17-Dec-2024
  • (2022)Provenance-enhanced Root Cause Analysis for Jupyter Notebooks2022 IEEE/ACM 15th International Conference on Utility and Cloud Computing (UCC)10.1109/UCC56403.2022.00058(327-333)Online publication date: Dec-2022
  • (2022)Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing SystemsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.302528919:3(1476-1491)Online publication date: 1-May-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Queue
Queue  Volume 14, Issue 2
Cloud Debugging
March-April 2016
131 pages
ISSN:1542-7730
EISSN:1542-7749
DOI:10.1145/2927299
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 March 2016
Published in QUEUE Volume 14, Issue 2

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article
  • Popular
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3,398
  • Downloads (Last 6 weeks)341
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Novel Approach to Optimizing Spark: Detecting Misconfigurations for Enhanced Efficiency2024 International Conference on IoT Based Control Networks and Intelligent Systems (ICICNIS)10.1109/ICICNIS64247.2024.10823258(1045-1050)Online publication date: 17-Dec-2024
  • (2022)Provenance-enhanced Root Cause Analysis for Jupyter Notebooks2022 IEEE/ACM 15th International Conference on Utility and Cloud Computing (UCC)10.1109/UCC56403.2022.00058(327-333)Online publication date: Dec-2022
  • (2022)Fault Injection Analytics: A Novel Approach to Discover Failure Modes in Cloud-Computing SystemsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2020.302528919:3(1476-1491)Online publication date: 1-May-2022
  • (2022)Localizing and Explaining Faults in Microservices Using Distributed Tracing2022 IEEE 15th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD55607.2022.00072(489-499)Online publication date: Jul-2022
  • (2021)Non-Prehensile Cooperative Object Transportation with Omnidirectional Mobile Robots: Organization, Control, Simulation, and Experimentation2021 International Symposium on Multi-Robot and Multi-Agent Systems (MRS)10.1109/MRS50823.2021.9620541(1-10)Online publication date: 4-Nov-2021
  • (2021)Simulation for Robotics Test Automation: Developer Perspectives2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST)10.1109/ICST49551.2021.00036(263-274)Online publication date: Apr-2021
  • (2020)Log Discovery for Troubleshooting Open Distributed Systems with TLQPractice and Experience in Advanced Research Computing 2020: Catch the Wave10.1145/3311790.3396633(224-231)Online publication date: 26-Jul-2020
  • (2020)Detecting Execution Anomalies As an Oracle for Autonomy Software Robustness2020 IEEE International Conference on Robotics and Automation (ICRA)10.1109/ICRA40945.2020.9197060(9366-9373)Online publication date: May-2020
  • (2020)Testing Microservices Architecture-Based Applications: A Systematic Mapping Study2020 27th Asia-Pacific Software Engineering Conference (APSEC)10.1109/APSEC51365.2020.00020(119-128)Online publication date: Dec-2020
  • (2019)Autograding Distributed Algorithms in Networked ContainersProceedings of the 50th ACM Technical Symposium on Computer Science Education10.1145/3287324.3287505(133-138)Online publication date: 22-Feb-2019
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Magazine Site

View this article on the magazine site (external)

Magazine Site

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media