Abstract
Distributed systems are difficult to implement correctly because they must handle both concurrency and failures: machines may crash at arbitrary points and networks may reorder, drop, or duplicate packets. Further, their behavior is often too complex to permit exhaustive testing. Bugs in these systems have led to the loss of critical data and unacceptable service outages. We present Verdi, a framework for implementing and formally verifying distributed systems in Coq. Verdi formalizes various network semantics with different faults, and the developer chooses the most appropriate fault model when verifying their implementation. Furthermore, Verdi eases the verification burden by enabling the developer to first verify their system under an idealized fault model, then transfer the resulting correctness guarantees to a more realistic fault model without any additional proof burden. To demonstrate Verdi's utility, we present the first mechanically checked proof of linearizability of the Raft state machine replication algorithm, as well as verified implementations of a primary-backup replication system and a key-value store. These verified systems provide similar performance to unverified equivalents.
- Amazon. Summary of the Amazon EC2 and Amazon RDS service disruption in the US East Region, Apr. 2011. http://aws.amazon.com/ message/65648/.Google Scholar
- S. Bishop, M. Fairbairn, M. Norrish, P. Sewell, M. Smith, and K. Wansbrough. Engineering with logic: HOL specification and symbolic-evaluation testing for TCP implementations. In Proceedings of the 33rd ACM Symposium on Principles of Programming Languages (POPL), pages 55–66, Charleston, SC, Jan. 2006. Google ScholarDigital Library
- T. D. Chandra, R. Griesemer, and J. Redstone. Paxos made live: An engineering perspective. In Proceedings of the 26th ACM SIGACTSIGOPS Symposium on Principles of Distributed Computing (PODC), pages 398–407, Portland, OR, Aug. 2007. Google ScholarDigital Library
- A. Chlipala. Mostly-automated verification of low-level programs in computational separation logic. In Proceedings of the 2011 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 234–245, San Jose, CA, June 2011. Google ScholarDigital Library
- R. L. Constable, S. F. Allen, H. M. Bromley, W. R. Cleaveland, J. F. Cremer, R. W. Harper, D. J. Howe, T. B. Knoblock, N. P. Mendler, P. Panangaden, J. T. Sasaki, and S. F. Smith. Implementing Mathematics with The Nuprl Proof Development System. Prentice Hall, 1986. Google ScholarDigital Library
- CoreOS. etcd: A highly-available key value store for shared configuration and service discovery, 2014.Google Scholar
- https://github.com/coreos/etcd.Google Scholar
- S. J. Garland and N. Lynch. Using I/O automata for developing distributed systems. In Foundations of Component-based Systems. Cambridge University Press, 2000. Google ScholarDigital Library
- D. Geels, G. Altekar, P. Maniatis, T. Roscoe, and I. Stoica. Friday: Global comprehension for distributed replay. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), pages 285–298, Cambridge, MA, Apr. 2007. Google ScholarDigital Library
- A. Guha, M. Reitblatt, and N. Foster. Machine-verified network controllers. In Proceedings of the 2013 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 483–494, Seattle, WA, June 2013. Google ScholarDigital Library
- Z. Guo, S. McDirmid, M. Yang, L. Zhuang, P. Zhang, Y. Luo, T. Bergan, P. Bodik, M. Musuvathi, Z. Zhang, and L. Zhou. Failure recovery: When the cure is worse than the disease. In Proceedings of the HotOS XIV, Santa Ana Pueblo, NM, May 2013. Google ScholarDigital Library
- M. Hayden. The Ensemble System. PhD thesis, Cornell University, 1998. Google ScholarDigital Library
- M. P. Herlihy and J. M. Wing. Linearizability: A correctness condition for concurrent objects. ACM Transactions on Programming Languages and Systems, 12(3):463–492, 1990. Google ScholarDigital Library
- J. J. Hickey, N. Lynch, and R. van Renesse. Specifications and proofs for Ensemble layers. In Proceedings of the 15th International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS), pages 119–133, York, UK, Mar. 2009. Google ScholarDigital Library
- High Scalability. The updated big list of articles on the Amazon outage, May 2011. http://highscalability.com/blog/2011/5/2/theupdated-big-list-of-articles-on-the-amazon-outage.html.Google Scholar
- D. Jackson. Software Abstractions: Logic, Language, and Analysis. MIT Press, Feb. 2012. Google ScholarDigital Library
- C. Killian, J. W. Anderson, R. Braud, R. Jhala, and A. Vahdat. Mace: Language support for building distributed systems. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 179–188, San Diego, CA, June 2007. Google ScholarDigital Library
- C. Killian, J. W. Anderson, R. Jhala, and A. Vahdat. Life, death, and the critical transition: Finding liveness bugs in systems code. In Proceedings of the 4th Symposium on Networked Systems Design and Implementation (NSDI), pages 243–256, Cambridge, MA, Apr. 2007. Google ScholarDigital Library
- G. Klein, K. Elphinstone, G. Heiser, J. Andronick, D. Cock, P. Derrin, D. Elkaduwe, K. Engelhardt, M. Norrish, R. Kolanski, T. Sewell, H. Tuch, and S. Winwood. seL4: Formal verification of an OS kernel. In Proceedings of the 22nd ACM Symposium on Operating Systems Principles (SOSP), pages 207–220, Big Sky, MT, Oct. 2009. Google ScholarDigital Library
- L. Lamport. Specifying Systems: The TLA+ Language and Tools for Hardware and Software Engineers. Addison-Wesley Professional, July 2002. Google ScholarDigital Library
- L. Lamport. Thinking for programmers, Apr. 2014. http://channel9.msdn.com/Events/Build/2014/3-642.Google Scholar
- V. Le, M. Afshari, and Z. Su. Compiler validation via equivalence modulo inputs. In Proceedings of the 2014 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 216–226, Edinburgh, UK, June 2014. Google ScholarDigital Library
- X. Leroy. Formal verification of a realistic compiler. Communications of the ACM, 52(7):107–115, July 2009. Google ScholarDigital Library
- X. Liu, C. Kreitz, R. van Renesse, J. Hickey, M. Hayden, K. Birman, and R. L. Constable. Building reliable, high-performance communication systems from components. In Proceedings of the 17th ACM Symposium on Operating Systems Principles (SOSP), pages 80–92, Kiawah Island, SC, Dec. 1999. Google ScholarDigital Library
- X. Liu, Z. Guo, X. Wang, F. Chen, X. Lian, J. Tang, M. Wu, M. F. Kaashoek, and Z. Zhang. D 3 S: Debugging deployed distributed systems. In Proceedings of the 5th Symposium on Networked Systems Design and Implementation (NSDI), pages 423–437, San Francisco, CA, Apr. 2008. Google ScholarDigital Library
- N. A. Lynch. Distributed Algorithms. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1996. ISBN 1558603484. Google ScholarDigital Library
- A. Nanevski, G. Morrisett, A. Shinnar, P. Govereau, and L. Birkedal. Ynot: Dependent types for imperative programs. In Proceedings of the 13th ACM SIGPLAN International Conference on Functional Programming (ICFP), Victoria, British Columbia, Canada, Sept. 2008. Google ScholarDigital Library
- C. Newcombe, T. Rath, F. Zhang, B. Munteanu, M. Brooker, and M. Deardeuff. Use of formal methods at Amazon Web Services, Sept. 2014. http://research.microsoft.com/enus/um/people/lamport/tla/formal-methods-amazon.pdf.Google Scholar
- NYTimes. Amazon’s trouble raises cloud computing doubts, Apr. 2011. http://www.nytimes.com/2011/04/23/technology/23cloud.html.Google Scholar
- D. Ongaro. Consensus: Bridging Theory and Practice. PhD thesis, Stanford University, Aug. 2014.Google ScholarDigital Library
- D. Ongaro and J. Ousterhout. In search of an understandable consensus algorithm. In Proceedings of the 2014 USENIX Annual Technical Conference, pages 305–319, Philadelphia, PA, June 2014. Google ScholarDigital Library
- J. L. Peterson. Petri nets. ACM Computing Surveys, pages 223–252, Sept. 1977. Google ScholarDigital Library
- V. Rahli. Interfacing with proof assistants for domain specific programming using EventML. In Proceedings of the 10th International Workshop On User Interfaces for Theorem Provers, Bremen, Germany, July 2012.Google Scholar
- D. Ricketts, V. Robert, D. Jang, Z. Tatlock, and S. Lerner. Automating formal proofs for reactive systems. In Proceedings of the 2014 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 452–462, Edinburgh, UK, June 2014. Google ScholarDigital Library
- T. Ridge. Verifying distributed systems: The operational approach. In Proceedings of the 36th ACM Symposium on Principles of Programming Languages (POPL), pages 429–440, Savannah, GA, Jan. 2009. Google ScholarDigital Library
- D. Sangiorgi and D. Walker. PI-Calculus: A Theory of Mobile Processes. Cambridge University Press, New York, NY, USA, 2001. ISBN 0521781779. Google ScholarDigital Library
- N. Schiper, V. Rahli, R. van Renesse, M. Bickford, and R. L. Constable. Developing correctly replicated databases using formal tools. In Proceedings of the 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 395–406, Atlanta, GA, June 2014. Google ScholarDigital Library
- Verdi. Verdi, 2014. https://github.com/uwplse/verdi.Google Scholar
- M. Yabandeh, N. Kneževi´c, D. Kosti´c, and V. Kuncak. CrystalBall: Predicting and preventing inconsistencies in deployed distributed systems. In Proceedings of the 5th Symposium on Networked Systems Design and Implementation (NSDI), pages 229–244, San Francisco, CA, Apr. 2008. Google ScholarDigital Library
- J. Yang and C. Hawblitzel. Safe to the last instruction: Automated verification of a type-safe operating system. In Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 99–110, Toronto, Canada, June 2010. Google ScholarDigital Library
- J. Yang, T. Chen, M. Wu, Z. Xu, X. Liu, H. Lin, M. Yang, F. Long, L. Zhang, and L. Zhou. M O D IST : Transparent model checking of unmodified distributed systems. In Proceedings of the 6th Symposium on Networked Systems Design and Implementation (NSDI), pages 213–228, Boston, MA, Apr. 2009. Google ScholarDigital Library
- X. Yang, Y. Chen, E. Eide, and J. Regehr. Finding and understanding bugs in C compilers. In Proceedings of the 2011 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 283–294, San Jose, CA, June 2011. Google ScholarDigital Library
- D. Yuan, Y. Luo, X. Zhuang, G. R. Rodrigues, X. Zhao, Y. Zhang, P. U. Jain, and M. Stumm. Simple testing can prevent most critical failures: An analysis of production failures in distributed data-intensive systems. In Proceedings of the 11th Symposium on Operating Systems Design and Implementation (OSDI), pages 249–265, Broomfield, CO, Oct. 2014. Google ScholarDigital Library
- P. Zave. Using lightweight modeling to understand Chord. ACM SIGCOMM Computer Communication Review, 42(2):49–57, Apr. 2012. Google ScholarDigital Library
Index Terms
- Verdi: a framework for implementing and formally verifying distributed systems
Recommendations
Verdi: a framework for implementing and formally verifying distributed systems
PLDI '15: Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and ImplementationDistributed systems are difficult to implement correctly because they must handle both concurrency and failures: machines may crash at arbitrary points and networks may reorder, drop, or duplicate packets. Further, their behavior is often too complex ...
Planning for change in a formal verification of the raft consensus protocol
CPP 2016: Proceedings of the 5th ACM SIGPLAN Conference on Certified Programs and ProofsWe present the first formal verification of state machine safety for the Raft consensus protocol, a critical component of many distributed systems. We connected our proof to previous work to establish an end-to-end guarantee that our implementation ...
A unified Coq framework for verifying C programs with floating-point computations
CPP 2016: Proceedings of the 5th ACM SIGPLAN Conference on Certified Programs and ProofsWe provide concrete evidence that floating-point computations in C programs can be verified in a homogeneous verification setting based on Coq only, by evaluating the practicality of the combination of the formal semantics of CompCert Clight and the ...
Comments