Abstract
Deterministic replay is a type of emerging technique dedicated to providing deterministic executions of computer programs in the presence of nondeterministic factors. The application scopes of deterministic replay are very broad, making it an important research topic in domains such as computer architecture, operating systems, parallel computing, distributed computing, programming languages, verification, and hardware testing.
In this survey, we comprehensively review existing studies on deterministic replay by introducing a taxonomy. Basically, existing deterministic replay schemes can be classified into two categories, single-processor (SP) schemes and multiprocessor (MP) schemes. By reviewing the details of these two categories of schemes respectively, we summarize and compare how existing schemes address technical issues such as log size, record slowdown, replay slowdown, implementation cost, and probe effect, which may shed some light on future studies on deterministic replay.
- S. Adve and H. Boehm. 2010. Memory models: A case for rethinking parallel languages and hardware. Commun. ACM 53, 8, 90--101. Google ScholarDigital Library
- S. V. Adve and M. D. Hill. 1990. Weak ordering—a new definition. In Proceedings of the 17th Annual International Symposium on Computer Architecture (ISCA’90). 2--14. Google ScholarDigital Library
- G. Altekar and I. Stoica. 2009. ODR: Output-deterministic replay for multicore debugging. In Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). 193--206. Google ScholarDigital Library
- A. Aviram, S. Weng, S. Hu, and Bryan Ford. 2010. Efficient system-enforced deterministic parallelism. In Proceedings of the 9th USENIX Symposium on Operating System Design and Implementation (OSDI’10). 1--16. Google ScholarDigital Library
- J. F. Bartlett. 1981. A non stop kernel. In Proceedings of the 8th ACM Symposium on Operating Systems Principles (SOSP’81). 22--29. Google ScholarDigital Library
- A. Basu, J. Bobba, and M. D. Hill. 2011. Karma: Scalable deterministic record-replay. In Proceedings of the 25th ACM International Conference on Supercomputing (ICS’11). 359--368. Google ScholarDigital Library
- T. Bergan, O. Anderson, J. Devietti, L. Ceze, and D. Grossman. 2010. CoreDet: A compiler and runtime system for deterministic multithreaded execution. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). Google ScholarDigital Library
- C. Bienia, S. Kumar, J. P. Singh, and K. Li. 2008. The PARSEC benchmark suite: Characterization and architectural implications. In Proceedings of the 2008 Parallel Architectures and Compilation Techniques (PACT-08). 72--81. Google ScholarDigital Library
- A. Bouteiller, G. Bosilca, and J. Dongarra. 2007. Retrospect: Deterministic replay of MPI applications for interactive distributed debugging. In Proceedings of the 14th European PVM/MPI User’s Group Conference (PVM/MPI’07). Google ScholarDigital Library
- T. Bressoud and F. Schneider. 1996. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst. 14, 1, 1--11. Google ScholarDigital Library
- L. Ceze, J. Tuck, P. Montesinos, and J. Torrellas. 2007. BulkSC: Bulk enforcement of sequential consistency. In Proceedings of the 34th ACM/IEEE International Symposium on Computer Architecture (ISCA’07). 278--289. Google ScholarDigital Library
- H. Chen, X. Wu, L. Yuan, B. Zang, P. Yew, and F. Chong. 2008. From speculation to security: Practical and efficient information flow tracking using speculative hardware. In Proceedings of the 35th ACM/IEEE International Symposium on Computer Architecture (ISCA’08). 401--412. Google ScholarDigital Library
- Y. Chen, T. Chen, and W. Hu. 2009. Global Clock, Physical Time Order and Pending Period Analysis in Multiprocessor Systems. (http://arxiv.org/pdf/0903.4961 2009).Google Scholar
- Y. Chen, T. Chen, L. Li, L. Li, L. Yang, M. Su, and W. Hu. 2013. LDet: Determinizing asynchronous transfer for post-silicon debugging. IEEE Trans. Comput. 62, 9, 1732--1744. Google ScholarDigital Library
- Y. Chen, W. Hu, T. Chen, and R. Wu. 2010. LReplay: A pending period based deterministic replay scheme. In Proceedings of the 37th ACM/IEEE International Symposium on Computer Architecture (ISCA’10). 187--197. Google ScholarDigital Library
- J. Chow, T. Garfinkel, and P. Chen. 2008. Decoupling dynamic program analysis from execution in virtual environments. In Proceedings of the USENIX Annual Technical Conference (USENIX’08). Google ScholarDigital Library
- J. Chow, D. Lucchetti, T. Garfinkel, G. Lefebvre, R. Gardner, J. Mason, S. Small, and P. Chen. 2010. Multi-stage replay with crosscut. In Proceedings of the 6th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’10). Google ScholarDigital Library
- C. Clémenoņ, J. Fritscher, M. Meehan, and R. Rühl. 1995. An implementation of race detection and deterministic replay with MPI. In Proceedings of the 1st International European Conference on Parallel and Distributed Computing (Euro-Par’95). Google ScholarDigital Library
- R. Curtis and L. D. Wittie. 1982. BUGNET: A debugging system for parallel programming environments. In Proceedings of the 3rd International Conference on Distributed Computing Systems (ICDCS’82).Google Scholar
- J. de Kergommeaux, M. Ronsse, and K. de Bosschere. 1999. MPL*: Efficient record/play of nondeterministic features of message passing libraries. In Proceedings of the 14th European PVM/MPI User’s Group Conference (PVM/MPI’99). Google ScholarDigital Library
- J. Devietti, B. Lucia, L. Ceze, and M. Oskin. 2009. DMP: Deterministic shared memory multiprocessing. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09). Google ScholarDigital Library
- J. Devietti, J. Nelson, T. Bergan, L. Ceze, and D. Grossman. 2011. RCDC: A relaxed consistency deterministic computer. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). Google ScholarDigital Library
- G. Dunlap, S. King, S. Cinar, M. Basrai, and P. Chen. 2002. ReVirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th USENIX Symposium on Operating System Design and Implementation (OSDI’02). Google ScholarDigital Library
- G. Dunlap, D. Lucchetti, M. Fetterman, and P. Chen. 2008. Execution replay of multiprocessor virtual machines. In Proceedings of the 4th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE’08). Google ScholarDigital Library
- E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. Comput. Surv. 34, 3, 375--408. Google ScholarDigital Library
- T. Foster, D. Lastor, and P. Singh. 2007. First silicon functional validation and debug of multicore microprocessors. IEEE Trans. Very Large Scale Integr. Syst. 15, 5, 495--504. Google ScholarDigital Library
- D. Geels, G. Altekar, S. Shenker, and I. Stoica. 2006. Replay debugging for distributed applications. In Proceedings of the USENIX Annual Technical Conference (USENIX’06). Google ScholarDigital Library
- GNU. 2009. Gdb: The gnu project debugger. (http://www.gnu.org/software/gdb 2009).Google Scholar
- J. R. Goodman. 1991. Cache Consistency and Sequential Consistency. University of Wisconsin-Madison, Computer Sciences Department.Google Scholar
- M. Goodstein, E. Vlachos, S. Chen, P. Gibbons, M. Kozuch, and T. Mowry. 2010. Butterfly analysis: Adapting dataflow analysis to dynamic parallel monitoring. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). Google ScholarDigital Library
- Z. Guo, X.Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. 2008. R2: An application-level kernel for record and replay. In Proceedings of the 8th USENIX Symposium on Operating System Design and Implementation (OSDI’08). Google ScholarDigital Library
- S. Hangal, D. Vahia, C. Manovit, and J.-Y. J. Lu. 2004. TSOtool: A program for verifying memory systems using the memory consistency model. In Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA’04). 114. Google ScholarDigital Library
- M. Heath, W. Burleson, and I. Harris. 2005. Synchro-tokens: A deterministic GALS methodology for chip-level debug and test. IEEE Trans. Comput. 54, 12, 1532--1546. Google ScholarDigital Library
- D. Hower, P. Dudnik, M. Hill, and D. Wood. 2011. Calvin: Deterministic or not? Free will to choose. In Proceedings of the 17th International Symposium on High-Performance Computer Architecture (HPCA’11). Google ScholarDigital Library
- D. Hower and M. Hill. 2008. Rerun: Exploiting episodes for lightweight memory race recording. In Proceedings of the 35th ACM/IEEE International Symposium on Computer Architecture (ISCA’08). Google ScholarDigital Library
- W. Hu, J. Wang, X. Gao, Y. Chen, Q. Liu, and G. Li. 2009. Godson-3: A scalable multicore RISC processor with x86 emulation. IEEE Micro 29, 2, 17--29. Google ScholarDigital Library
- S. King, G. Dunlap, and P. Chen. 2005. Debugging operating systems with time-traveling virtual machines. In Proceedings of the USENIX Annual Technical Conference (USENIX’05). Google ScholarDigital Library
- R. Konuru, H. Srinivasan, and J. Choi. 2000. Deterministic replay of distributed java applications. In Proceedings of the 14th IEEE International Parallel and Distributed Processing Symposium (IPDPS’00). Google ScholarDigital Library
- D. Kranzlmüller, C. Schaubschläger, and J. Volkert. 2001. An integrated record&replay mechanism for nondeterministic message passing programs. In Proceedings of the 8th European PVM/MPI User’’s Group Conference (PVM/MPI’01). Google ScholarDigital Library
- O. Laadan, N. Viennot, and J. Nieh. 2010. Transparent, lightweight application execution replay on commodity multiprocessor operating systems. In Proceedings of the ACM SIGMETRICS International Conference on Measurements and Modeling of Computer Systems (SIGMETRICS’10). Google ScholarDigital Library
- L. Lamport. 1978. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7, 558--565. Google ScholarDigital Library
- D. Lee, P. Chen, J. Flinn, and S. Narayanasamy. 2012. Chimera: Hybrid program analysis for determinism. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’12). Google ScholarDigital Library
- D. Lee, M. Said, S. Narayanasamy, and Z. Yang. 2011. Offline symbolic analysis to infer total store order. In Proceedings of the 17th International Symposium on High-Performance Computer Architecture (HPCA’11). Google ScholarDigital Library
- D. Lee, M. Said, S. Narayanasamy, Z. Yang, and C. Pereira. 2009. Offline symbolic analysis for multi-processor execution replay. In Proceedings of the 42nd ACM/IEEE International Symposium on Microarchitecture (MICRO’09). Google ScholarDigital Library
- D. Lee, B. Wester, K. Veeraraghavan, S. Narayanasamy, P. Chen, and J. Flinn. 2010. Respec: Efficient online multiprocessor replay via speculation and external determinism. In Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’10). Google ScholarDigital Library
- X. Liu, W. Lin, A. Pan, and Z. Zhang. 2007. WiDS checker: Combating bugs in distributed systems. In Proceedings of the 4th USENIX Symposium on Networked Systems Design and Implementation (NSDI’07). Google ScholarDigital Library
- D. Lucchetti, S. Reinhardt, and P. Chen. 2005. ExtraVirt: Detecting and recovering from transient processor faults. In The 20nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP) Work-in-Progress Session. Google ScholarDigital Library
- C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, P. Lowney, S. Wallace, V. Reddi, and K. Hazelwood. 2007. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’07). Google ScholarDigital Library
- M. Maruyama, T. Tsumura, and H. Nakashima. 2005. Parallel program debugging based on data-replay. In Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Systems (PDCS’05).Google Scholar
- P. Montesinos, L. Ceze, and J. Torrellas. 2008. DeLorean: Recording and deterministically replaying shared-memory multiprocessor execution effciently. In Proceedings of the 35th ACM/IEEE International Symposium on Computer Architecture (ISCA’08). Google ScholarDigital Library
- M. Musuvathi, S. Qadeer, T. Ball, G. Basler, P. Nainar, and I. Neamtiu. 2008. Finding and reproducing heisenbugs in concurrent programs. In Proceedings of the 8th USENIX Symposium on Operating System Design and Implementation (OSDI’08). Google ScholarDigital Library
- S. Narayanasamy, C. Pereira, and B. Calder. 2006. Recording shared memory dependencies using strata. In Proceedings of the 12nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’06). Google ScholarDigital Library
- S. Narayanasamy, G. Pokam, and B. Calder. 2005. BugNet: Continuously recording program execution for deterministic replay debugging. In Proceedings of the 32st ACM/IEEE International Symposium on Computer Architecture (ISCA’05). Google ScholarDigital Library
- R. Netzer. 1993. Optimal tracing and replay for debugging shared-memory parallel programs. In Proceedings of the ACM/ONR Workshop on Parallel and Distributed Debugging. Google ScholarDigital Library
- R. Netzer and B. Miller. 1990. On the complexity of event ordering for shared-memory parallel program executions. In Proceedings of the International Conference on Parallel Processing (ICPP’90).Google Scholar
- R. Netzer and B. Miller. 1992. Optimal tracing and replay for debugging message-passing parallel programs. In Proceedings of the 6th ACM/IEEE Conference on Supercomputing (SC’92). Google ScholarDigital Library
- J. Newsome and D. Song. 2005. Dynamic taint analysis for automatic detection, analysis, and signaturegeneration of exploits on commodity software. In Proceedings of the Network and Distributed System Security Symposium (NDSS’05).Google Scholar
- E. Nightingale, P. Chen, and J. Flinn. 2005. Speculative execution in a distributed file system. In Proceedings of the 20nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP’05). Google ScholarDigital Library
- M. Olszewski, J. Ansel, and S. Amarasinghe. 2009. Kendo: Efficient deterministic multithreading in software. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’09). Google ScholarDigital Library
- S. Park, Y. Zhou, W. Xiong, Z. Yin, R. Kaushik, K. Lee, and S. Lu. 2009. PRES: Probabilistic replay with execution sketching on multiprocessors. In Proceedings of the 22nd ACM SIGOPS Symposium on Operating Systems Principles (SOSP’09). Google ScholarDigital Library
- H. Patil, C. Pereira, M. Stallcup, G. Lueck, and J. Cownie. 2010. Pinplay: A framework for deterministic replay and reproducible analysis of parallel programs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO’10). Google ScholarDigital Library
- G. Pokam, C. Pereira, K. Danne, R. Kassa, and A. Adl-Tabatabai. 2009. Architecting a chunk-based memory race recorder in modern CMPs. In Proceedings of the 42nd ACM/IEEE International Symposium on Microarchitecture (MICRO’09). Google ScholarDigital Library
- M. Ronsse and K. de Bosschere. 1999. RecPlay: A fully integrated practical record/replay system. ACM Trans. Comput. Syst. 17, 2, 133--152. Google ScholarDigital Library
- M. Russinovich and B. Cogswell. 1996. Replay for concurrent non-deterministic shared memory applications. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’96). Google ScholarDigital Library
- Y. Saito. 2005. Jockey: A user-space library for record-replay debugging. In Proceedings of the 6th International Workshop on Automated Analysis-driven Debugging (AADEBUG’05). Google ScholarDigital Library
- S. Sarangi, B. Greskamp, and J. Torrellas. 2006. CADRE: Cycle-accurate deterministic replay for hardware debugging. In Proceedings of the 37th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’06). Google ScholarDigital Library
- J. Slye and E. Elnozahy. 1996. Supporting nondeterministic execution in fault-tolerant systems. In Proceedings of the 26th IEEE International Symposium on Fault-Tolerant Computing (FTCS’96). Google ScholarDigital Library
- J. Slye and E. Elnozahy. 1998. Support for software interrupts in log-based rollback-recovery. IEEE Trans. Comput. 47, 10, 1113--1123. Google ScholarDigital Library
- D. Sorin, M. Martin, M. Hill, and D. Wood. 2002. Safetynet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Proceedings of the 29th ACM/IEEE International Symposium on Computer Architecture (ISCA’02). Google ScholarDigital Library
- M. Su, Y. Chen, and X. Gao. 2010. A general method to make multi-clock system deterministic. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE’10). Google ScholarDigital Library
- K. Veeraraghavan, D. Lee, B. Wester, J. Ouyang, P. Chen, J. Flinn, and S. Narayanasamy. 2011. DoublePlay: Parallelizing sequential logging and replay. In Proceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11). Google ScholarDigital Library
- G. Voskuilen, F. Ahmad, and T. Vijaykumar. 2010. Timetraveler: Exploiting acyclic races for optimizing memory race recording. In Proceedings of the 37th ACM/IEEE International Symposium on Computer Architecture (ISCA’10). Google ScholarDigital Library
- J. Voung, R. Jhala, and S. Lerner. 2007. RELAY: Static race detection on millions of lines of code. In Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC-FSE’07). Google ScholarDigital Library
- S. Woo, M. Ohara, E. Torrie, J. Singh, and A. Gupta. 1995. The SPLASH-2 programs: Characterization and methodological considerations. In Proceedings of the 22nd ACM/IEEE International Symposium on Computer Architecture (ISCA’95). Google ScholarDigital Library
- M. Xu, R. Bodik, and M. Hill. 2003. A “Flight Data Recorder” for enabling full-system multiprocessor deterministic replay. In Proceedings of the 30th ACM/IEEE International Symposium on Computer Architecture (ISCA’03). Google ScholarDigital Library
- M. Xu, M. Hill, and R. Bodik. 2006. A regulated transitive reduction (RTR) for longer memory race recording. In Proceedings of the 12nd International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’06). Google ScholarDigital Library
- M. Xu, V. Malyugin, J. Sheldon, G. Venkitachalam, and B. Weissman. 2007. ReTrace: Collecting execution trace with virtual machine deterministic replay. In Proceedings of the 3rd Annual Workshop on Modeling, Benchmarking and Simulation (WoBS’07).Google Scholar
- R. Xue, X. Liu, M. Wu, Z. Guo, W. Chen, W. Zheng, Z. Zhang, and G. Voelker. 2009. MPIWiz: Subgroup reproducible replay of MPI applications. In Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’09). Google ScholarDigital Library
- J. Zhai, W. Chen, and W. Zheng. 2010. Phantom: Predicting performance of parallel applications on large-scale parallel machines using a single node. In Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’10). Google ScholarDigital Library
Index Terms
- Deterministic Replay: A Survey
Recommendations
Transparent mutable replay for multicore debugging and patch validation
ASPLOS '13We present Dora, a mutable record-replay system which allows a recorded execution of an application to be replayed with a modified version of the application. This feature, not available in previous record-replay systems, enables powerful new ...
Transparent mutable replay for multicore debugging and patch validation
ASPLOS '13: Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systemsWe present Dora, a mutable record-replay system which allows a recorded execution of an application to be replayed with a modified version of the application. This feature, not available in previous record-replay systems, enables powerful new ...
Transparent mutable replay for multicore debugging and patch validation
ASPLOS '13We present Dora, a mutable record-replay system which allows a recorded execution of an application to be replayed with a modified version of the application. This feature, not available in previous record-replay systems, enables powerful new ...
Comments