skip to main content
10.1145/2723372.2723711acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open access

Lineage-driven Fault Injection

Published: 27 May 2015 Publication History

Abstract

In large-scale data management systems, failure is practically a certainty. Fault-tolerant protocols and components are notoriously difficult to implement and debug. Worse still, choosing existing fault-tolerance mechanisms and integrating them correctly into complex systems remains an art form, and programmers have few tools to assist them.
We propose a novel approach for discovering bugs in fault-tolerant data management systems: lineage-driven fault injection. A lineage-driven fault injector reasons backwards from correct system outcomes to determine whether failures in the execution could have prevented the outcome. We present MOLLY, a prototype of lineage-driven fault injection that exploits a novel combination of data lineage techniques from the database literature and state-of-the-art satisfiability testing. If fault-tolerance bugs exist for a particular configuration, MOLLY finds them rapidly, in many cases using a order of magnitude fewer executions than random fault injection. Otherwise, MOLLY certifies that the code is bug-free for that configuration.

References

[1]
The Netflix Simian Army. http://techblog.netflix.com/2011/07/netflix-simian-army.html, 2011.
[2]
Kafka 0.8.0 Documentation. https://kafka.apache.org/08/documentation.html, 2013.
[3]
S. Abiteboul, E. Antoine, and J. Stoyanovich. The Webdamlog System Managing Distributed Knowledge on the Web. CoRR, abs/1304.4187, 2013.
[4]
P. A. Alsberg and J. D. Day. A Principle for Resilient Sharing of Distributed Resources. ICSE '76.
[5]
P. Alvaro, N. Conway, J. M. Hellerstein, and W. R. Marczak. Consistency Analysis in Bloom: a CALM and Collected Approach. CIDR'12.
[6]
P. Alvaro, A. Hutchinson, N. Conway, W. R. Marczak, and J. M. Hellerstein. BloomUnit: Declarative Testing for Distributed Programs. DBTest '12.
[7]
P. Alvaro, W. R. Marczak, N. Conway, J. M. Hellerstein, D. Maier, and R. Sears. Dedalus: Datalog in Time and Space. Datalog'10.
[8]
T. J. Ameloot, F. Neven, and J. Van den Bussche. Relational Transducers for Declarative Networking. PODS'12.
[9]
J. Armstrong. Programming Erlang: Software for a Concurrent World. 2007.
[10]
P. Bailis, A. Davidson, A. Fekete, A. Ghodsi, J. M. Hellerstein, and I. Stoica. Highly Available Transactions: virtues and limitations. VLDB'14.
[11]
P. Bailis and K. Kingsbury. The Network is Reliable. Commun. ACM, 2014.
[12]
J. Baker, C. Bond, J. C. Corbett, J. Furman, A. Khorlin, J. Larson, J.-M. Leon, Y. Li, A. Lloyd, and V. Yushprakh. Megastore: Providing Scalable, Highly Available Storage for Interactive Services. CIDR'11.
[13]
T. Ball, V. Levin, and S. K. Rajamani. A Decade of Software Model Checking with SLAM. Commun. ACM, 2011.
[14]
M. Ben-Or. Another Advantage of Free Choice (Extended Abstract): Completely Asynchronous Agreement Protocols. PODC '83.
[15]
P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987.
[16]
H. Blodget. Amazon's Cloud Crash Disaster Permanently Destroyed Many Customers? Data. http://www.businessinsider.com/amazon-lost-data-2011-4, April 2011.
[17]
P. Buneman, S. Khanna, and W.-c. Tan. Why and Where: A Characterization of Data Provenance. ICDT'01.
[18]
C. Cadar, D. Dunbar, and D. Engler. KLEE: Unassisted and Automatic Generation of High-coverage Tests for Complex Systems Programs. OSDI'08.
[19]
T. D. Chandra, V. Hadzilacos, and S. Toueg. The Weakest Failure Detector for Solving Consensus. J. ACM, July 1996.
[20]
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: a Distributed Storage System for Structured Data. OSDI'06.
[21]
J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in Databases: Why, How, and Where. Found. Trends databases, April 2009.
[22]
K. Claessen and J. Hughes. QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs. ICFP '00.
[23]
E. Clarke, O. Grumberg, S. Jha, Y. Lu, and H. Veith. Counterexample-guided Abstraction Refinement for Symbolic Model Checking. J. ACM, 2003.
[24]
J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's Globally-distributed Database. OSDI'12.
[25]
Y. Cui, J. Widom, and J. L. Wiener. Tracing the lineage of view data in a warehousing environment. ACM Trans. Database Syst., June 2000.
[26]
S. Dawson, F. Jahanian, and T. Mitton. ORCHESTRA: A Fault Injection Environment for Distributed Systems. Technical report, FTCS, 1996.
[27]
J. Dean. Designs, Lessons and Advice from Building Large Distributed Systems. http://www.cs.cornell.edu/projects/ladis2009/talks/deankeynoteladis2009.pdf, 2009. Ladis'09 Keynote.
[28]
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. SOSP'07.
[29]
M. J. Fischer, N. A. Lynch, and M. S. Paterson. Impossibility o Distributed Consensus with One Faulty Process. J. ACM, April 1985.
[30]
D. Fisman, O. Kupferman, and Y. Lustig. On verifying fault tolerance of distributed protocols. In Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of LNCS. Springer Berlin Heidelberg, 2008.
[31]
H. Garcia-Molina. Elections in a distributed computing system. IEEE Trans. Comput., January 1982.
[32]
P. Gill, N. Jain, and N. Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. SIGCOMM '11.
[33]
J. Gray. Notes on data base operating systems. In Operating Systems, An Advanced Course, 1978.
[34]
J. Gray. Why do computers stop and what can be done about it?, 1985.
[35]
T. J. Green, G. Karvounarakis, and V. Tannen. Provenance semirings. PODS '07.
[36]
H. S. Gunawi, T. Do, P. Joshi, P. Alvaro, J. M. Hellerstein, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, K. Sen, and D. Borthakur. FATE and DESTINI: A Framework for Cloud Recovery Testing. NSDI'11.
[37]
S. Han and S. Ratnasamy. Large-scale Computation Not at the Cost of Expressiveness. HotOS'13.
[38]
M. Herschel, M. A. Hernández, and W.-C. Tan. Artemis: A System for Analyzing Missing Answers.
[39]
G. Holzmann. The SPIN Model Checker: Primer and Reference Manual. Addison-Wesley Professional, 2003.
[40]
J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the Provenance of Non-answers to Queries over Extracted Data. VLDB'08.
[41]
P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free Coordination for Internet-scale Systems. USENIX ATC'10.
[42]
M. Interlandi, L. Tanca, and S. Bergamaschi. Datalog in time and space, synchronously. CEUR'13.
[43]
F. P. Junqueira, B. C. Reed, and M. Serafini. Zab: High-performance broadcast for primary-backup systems. DSN '11.
[44]
G. A. Kanawati, N. A. Kanawati, and J. A. Abraham. Ferrari: A flexible software-based fault and error injection system. IEEE Trans. Comput., Feb 1995.
[45]
G. Karvounarakis, Z. G. Ives, and V. Tannen. Querying data provenance. SIGMOD '10.
[46]
N. P. Katta, J. Rexford, and D. Walker. Logic programming for software-defined networks. XLDI'12.
[47]
G. Kiczales, J. Lamping, A. Mendhekar, C. Maeda, C. Lopes, J. marc Loingtier, and J. Irwin. Aspect-oriented Programming. ECOOP'97.
[48]
C. E. Killian, J. W. Anderson, R. Jhala, and A. Vahdat. Life, Death, and the Critical Transition: Finding Liveness Bugs in Systems Code. NSDI'07.
[49]
K. Kingsbury. Call me maybe: Kafka. http://aphyr.com/posts/293-call-me-maybe-kafka, 2013.
[50]
S. Köhler, B. Ludäscher, and Y. Smaragdakis. Declarative datalog debugging for mere mortals. In Datalog in Academia and Industry, LNCS. Springer Berlin Heidelberg, 2012.
[51]
S. Köhler, B. Ludäscher, and D. Zinn. First-Order Provenance Games. In In Search of Elegance in the Theory and Practice of Computation, volume 8000 of LNCS. Springer, 2013.
[52]
L. Kuper and R. R. Newton. LVars: Lattice-based Data Structures for Deterministic Parallelism. FHPC'13.
[53]
L. Lamport. Time, Clocks, and the Ordering of Events in a Distributed System. Commun. ACM, Jul 1978.
[54]
L. Lamport. The Part-time Parliament. ACM Transactions on Computer Systems, May 1998.
[55]
W. Lloyd, M. J. Freedman, M. Kaminsky, and D. G. Andersen. Don'T Settle for Eventual: Scalable Causal Consistency for Wide-area Storage with COPS. SOSP '11.
[56]
B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis, T. Roscoe, and I. Stoica. Implementing Declarative Overlays. SOSP '05.
[57]
O. Malik. When the Cloud Fails: T-Mobile, Microsoft Lose Sidekick Customer Data. http://gigaom.com/2009/10/10/when-cloud-fails-t-mobile-microsoft-losesidekick-customer-data/, Oct 2009.
[58]
W. R. Marczak, P. Alvaro, N. Conway, J. M. Hellerstein, and D. Maier. Confluence Analysis for Distributed Programs: A Model-Theoretic Approach. Datalog'12.
[59]
P. D. Marinescu and G. Candea. LFI: A practical and general library-level fault injector. In DSN. IEEE, 2009.
[60]
A. Meliou and D. Suciu. Tiresias: The Database Oracle for How-to Queries. SIGMOD '12.
[61]
S. Mullender, editor. Distributed Systems. Addison-Wesley, second edition, 1993.
[62]
K.-K. Muniswamy-Reddy, D. A. Holland, U. Braun, and M. Seltzer. Provenance-aware storage systems. ATEC '06.
[63]
M. Musuvathi, D. Y. W. Park, A. Chou, D. R. Engler, and D. L. Dill. CMC: A Pragmatic Approach to Model Checking Real Code. SIGOPS Oper. Syst. Rev., 2002.
[64]
M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. A. Nainar, and I. Neamtiu. Finding and Reproducing Heisenbugs in Concurrent Programs. OSDI'08.
[65]
T. Nelson, M. Scheer, A. Ferguson, and S. Krishnamurthi. Tierless Programming and Reasoning for Software-Defined Networks. NSDI'14.
[66]
R. Perera, U. A. Acar, J. Cheney, and P. B. Levy. Functional Programs That Explain Their Work. ICFP '12.
[67]
K. Petersen, M. J. Spreitzer, D. B. Terry, M. M. Theimer, and A. J. Demers. Flexible Update Propagation for Weakly Consistent Replication. SIGOPS Oper. Syst. Rev., Dec 1997.
[68]
N. Raychaudhuri. Scala in Action. Manning Publications Co., 2013.
[69]
S. Riddle, S. Köhler, and B. Ludäscher. Towards Constraint Provenance Games. TaPP'14.
[70]
M. C. Rinard and P. C. Diniz. Commutativity Analysis: a New Analysis Technique for Parallelizing Compilers. ACM Trans. Program. Lang. Syst., Nov 1997.
[71]
F. B. Schneider. Implementing Fault-tolerant Services Using the State Machine Approach: a Tutorial. ACM Comput. Surv., 22(4), Dec. 1990.
[72]
K. Sen and G. Agha. Automated Systematic Testing of Open Distributed Programs. In L. Baresi and R. Heckel, editors, Fundamental Approaches to Software Engineering, volume 3922 of LNCS. 2006.
[73]
M. A. Shah, J. M. Hellerstein, and E. Brewer. Highly Available, Fault-tolerant, Parallel Dataflows. SIGMOD'04.
[74]
M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski. A comprehensive study of Convergent and Commutative Replicated Data Types. Research report, INRIA, 2011.
[75]
D. Skeen. Nonblocking Commit Protocols. SIGMOD '81.
[76]
M. Stonebraker. Concurrency control and consistency of multiple copies of data in distributed ingres. IEEE Trans. Softw. Eng., May 1979.
[77]
T. Stoppard. Arcadia: a play in two acts. Samuel French, Inc., 1993.
[78]
D. B. Terry, A. J. Demers, K. Petersen, M. J. Spreitzer, M. M. Theimer, and B. B. Welch. Session Guarantees for Weakly Consistent Replicated Data. PDIS '94.
[79]
R. H. Thomas. A Majority Consensus Approach to Concurrency Control for Multiple Copy Databases. ACM Trans. Database Syst., June 1979.
[80]
A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Calvin: Fast distributed transactions for partitioned database systems. SIGMOD '12.
[81]
J. D. Ullman. Principles of Database and Knowledge-Base Systems: Volume II: The New Technologies. W. H. Freeman & Co., 1990.
[82]
W. Vogels. Eventually Consistent. Commun. ACM, January 2009.
[83]
Y. Wu, A. Haeberlen, W. Zhou, and B. T. Loo. Answering Why-not Queries in Software-defined Networks with Negative Provenance. HotNets'13.
[84]
M. Yabandeh, N. Knezevic, D. Kostic, and V. Kuncak. CrystalBall: Predicting and Preventing Inconsistencies in Deployed Distributed Systems. NSDI'09.
[85]
J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou. MODIST: Transparent Model Checking of Unmodified Distributed Systems. NSDI'09.
[86]
Y. Yu, P. Manolios, and L. Lamport. Model checking tla+ specifications. CHARME '99.
[87]
M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized Streams: an Efficient and Fault-tolerant Model for Stream Processing on Large Clusters. HotCloud'12.
[88]
C. Zamfir and G. Candea. Execution Synthesis: A Technique for Automated Software Debugging. EuroSys '10.
[89]
W. Zhou, M. Sherr, T. Tao, X. Li, B. T. Loo, and Y. Mao. Efficient querying an maintenance of network provenance at internet-scale. SIGMOD '10.

Cited By

View all
  • (2025)Unveiling the microservices testing methods, challenges, solutions, and solutions gaps: A systematic mapping studyJournal of Systems and Software10.1016/j.jss.2024.112232220(112232)Online publication date: Feb-2025
  • (2024)MONARCHProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692025(529-543)Online publication date: 10-Jul-2024
  • (2024)Efficient exposure of partial failure bugs in distributed systems with inferred abstract statesProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691895(1267-1283)Online publication date: 16-Apr-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
May 2015
2110 pages
ISBN:9781450327589
DOI:10.1145/2723372
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2015

Check for updates

Author Tags

  1. fault-tolerance
  2. provenance
  3. verification

Qualifiers

  • Research-article

Funding Sources

  • NSF

Conference

SIGMOD/PODS'15
Sponsor:
SIGMOD/PODS'15: International Conference on Management of Data
May 31 - June 4, 2015
Victoria, Melbourne, Australia

Acceptance Rates

SIGMOD '15 Paper Acceptance Rate 106 of 415 submissions, 26%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)482
  • Downloads (Last 6 weeks)50
Reflects downloads up to 08 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Unveiling the microservices testing methods, challenges, solutions, and solutions gaps: A systematic mapping studyJournal of Systems and Software10.1016/j.jss.2024.112232220(112232)Online publication date: Feb-2025
  • (2024)MONARCHProceedings of the 2024 USENIX Conference on Usenix Annual Technical Conference10.5555/3691992.3692025(529-543)Online publication date: 10-Jul-2024
  • (2024)Efficient exposure of partial failure bugs in distributed systems with inferred abstract statesProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691895(1267-1283)Online publication date: 16-Apr-2024
  • (2024)Demystifying the Fight Against Complexity: A Comprehensive Study of Live Debugging Activities in Production Cloud SystemsProceedings of the 2024 ACM Symposium on Cloud Computing10.1145/3698038.3698568(341-360)Online publication date: 20-Nov-2024
  • (2024)Efficient Reproduction of Fault-Induced Failures in Distributed Systems with Feedback-Driven Fault InjectionProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695979(46-62)Online publication date: 4-Nov-2024
  • (2024)If At First You Don’t Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software SystemsProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles10.1145/3694715.3695971(63-78)Online publication date: 4-Nov-2024
  • (2024)MicroRes: Versatile Resilience Profiling in Microservices via Degradation Dissemination IndexingProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis10.1145/3650212.3652131(325-337)Online publication date: 11-Sep-2024
  • (2024)PROV-IO: A Cross-Platform Provenance Framework for Scientific Data on HPC SystemsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337455535:5(844-861)Online publication date: May-2024
  • (2024)MicroFI: Non-Intrusive and Prioritized Request-Level Fault Injection for Microservice ApplicationsIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2024.3363902(1-18)Online publication date: 2024
  • (2023)Smart Public Transportation Sensing: Enhancing Perception and Data Management for Efficient and Safety OperationsSensors10.3390/s2322922823:22(9228)Online publication date: 16-Nov-2023
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media