ABSTRACT
We present Scribe, the first system to provide transparent, low-overhead application record-replay and the ability to go live from replayed execution. Scribe introduces new lightweight operating system mechanisms, rendezvous and sync points, to efficiently record nondeterministic interactions such as related system calls, signals, and shared memory accesses. Rendezvous points make a partial ordering of execution based on system call dependencies sufficient for replay, avoiding the recording overhead of maintaining an exact execution ordering. Sync points convert asynchronous interactions that can occur at arbitrary times into synchronous events that are much easier to record and replay.
We have implemented Scribe without changing, relinking, or recompiling applications, libraries, or operating system kernels, and without any specialized hardware support such as hardware performance counters. It works on commodity Linux operating systems, and commodity multi-core and multiprocessor hardware. Our results show for the first time that an operating system mechanism can correctly and transparently record and replay multi-process and multi-threaded applications on commodity multiprocessors. Scribe recording overhead is less than 2.5% for server applications including Apache and MySQL, and less than 15% for desktop applications including Firefox, Acrobat, OpenOffice, parallel kernel compilation, and movie playback.
- D. F. Bacon and S. C. Goldstein. Hardware-Assisted Replay of Multiprocessor Programs. In Proceedings of the 1991 ACM/ONR Workshop on Parallel and Distributed Debugging, May 1991. Google ScholarDigital Library
- R. M. Balzer. EXDAMS: Extendable Debugging and Monitoring System. In Proceedings of the AFIPS Spring Joint Computer Conference, May 1969. Google ScholarDigital Library
- P. Bergheaud, D. Subhraveti, and M. Vertes. Fault Tolerance in Multiprocessor Systems Via Application Cloning. In Proceedings of the 27th International Conference on Distributed Computing Systems (ICDCS), June 2007. Google ScholarDigital Library
- T. C. Bressoud. TFT: A Software System for Application-Transparent Fault Tolerance. In Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing, June 1998. Google ScholarDigital Library
- T. C. Bressoud and F. B. Schneider. Hypervisor-Based Fault Tolerance. In Proceedings of the 15th Symposium on Operating Systems Principles (SOSP), Dec. 1995. Google ScholarDigital Library
- J.-D. Choi and H. Srinivasan. Deterministic Replay of Java Multithreaded Applications. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools, June 1998. Google ScholarDigital Library
- P. J. Courtois, F. Heymans, and D. L. Parnas. Concurrent Control with "Readers" and "Writers". Communications of the ACM, 14(10), 1971. Google ScholarDigital Library
- J. Devietti, B. Lucia, L. Ceze, and M. Oskin. DMP: Deterministic Shared Memory Multiprocessing. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar. 2009. Google ScholarDigital Library
- G. W. Dunlap, S. T. King, S. Cinar, M. A. Basrai, and P. M. Chen. ReVirt: Enabling Intrusion Analysis Through Virtual--Machine Logging and Replay. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), Dec. 2002. Google ScholarDigital Library
- G. W. Dunlap, D. G. Lucchetti, M. A. Fetterman, and P. M. Chen. Execution Replay of Multiprocessor Virtual Machines. In Proceedings of the 4th International Conference on Virtual Execution Environments (VEE), Mar. 2008. Google ScholarDigital Library
- D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay Debugging for Distributed Applications. In Proceedings of the 2006 USENIX Annual Technical Conference, June 2006. Google ScholarDigital Library
- Z. Guo, X. Wang, J. Tang, X. Liu, Z. Xu, M. Wu, M. F. Kaashoek, and Z. Zhang. R2: An Application-Level Kernel for Record and Replay. In Proceedings of the 8th Symposium on Operating Systems Design and Implementation (OSDI), Dec. 2008. Google ScholarDigital Library
- D. R. Hower and M. D. Hill. Rerun: Exploiting Episodes for Lightweight Memory Race Recording. In Proceedings of the 35th International Symposium on Computer Architecture (ISCA), June 2008. Google ScholarDigital Library
- O. Laadan, R. A. Baratto, D. Phung, S. Potter, and J. Nieh. DejaView: A Personal Virtual Computer Recorder. In Proceedings of the 21st Symposium on Operating Systems Principles (SOSP), Oct. 2007. Google ScholarDigital Library
- O. Laadan and J. Nieh. Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems. In Proceedings of the 2007 USENIX Annual Technical Conference, June 2007. Google ScholarDigital Library
- O. Laadan and J. Nieh. Operating System Virtualization: Practice and Experience. In Proceedings of the 3rd Annual Haifa Experimental Systems Conference (SYSTOR), May 2010. Google ScholarDigital Library
- T. J. Leblanc and J. M. Mellor-Crummey. Debugging Parallel Programs with Instant Replay. IEEE Transactions on Computers, C-36(4), Apr. 1987. Google ScholarDigital Library
- N. McWhirter, editor. The Guinness Book of World Records. Sterling Publishing Co., Inc, 1985.Google Scholar
- P. Montesinos, L. Ceze, and J. Torrellas. DeLorean: Recording and Deterministically Replaying Shared--Memory Multiprocesso rExecution Efficiently. In Proceedings of the 35th International Symposium on Computer Architecture (ISCA), June 2008. Google ScholarDigital Library
- P. Montesinos, M. Hicks, S. T. King, and J. Torrellas. Capo: a Software-Hardware Interface for Practical Deterministic Multiprocessor Replay. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar. 2009. Google ScholarDigital Library
- S. Narayanasamy, C. Pereira, and B. Calder. Recording Shared Memory Dependencies Using Strata. In Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Oct. 2006. Google ScholarDigital Library
- S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously Recording Program Execution for Deterministic Replay Debugging. In Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), 2005. Google ScholarDigital Library
- M. Olszweski, J. Ansel, and S. Amarasinghe. Kendo: Efficient Deterministic Multithreading in Software. In Proceedings of the 14th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Mar. 2009. Google ScholarDigital Library
- S. Osman, D. Subhraveti, G. Su, and J. Nieh. The Design and Implementation of Zap: A System for Migrating Computing Environments. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), Dec. 2002. Google ScholarDigital Library
- M. Russinovich and B. Cogswell. Replay for Concurrent Non-Deterministic Shared-Memory Applications. In Proceedings of the SIGPLAN Conference on Programming Language Design and Implementation (PLDI), May 1996. Google ScholarDigital Library
- Y. Saito. Jockey: a User-Space Library for Record-Replay Debugging. In Proceedings of the 6th International Symposium on Automated Analysis-Driven Debugging, Sept. 2005. Google ScholarDigital Library
- J. H. Slye and E. Elnozahy. Supporting Nondeterministic Execution in Fault-Tolerant Systems. In Proceedings of the 26th Annual International Symposium on Fault-Tolerant Computing, 1996. Google ScholarDigital Library
- S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou. Flashback: A Lightweight Extension for Rollback and Deterministic Replay for Software Debugging. In Proceedings of the 2004 USENIX Annual Technical Conference, June 2004. Google ScholarDigital Library
- D. Stodden, H. Eichner, M. Walter, and C. Trinitis. Hardware Instruction Counting for Log-based Rollback Recovery on x86-family Processors. In Proceedings of the 3rd International Service Availability Symposium (ISAS), 2006. Google ScholarDigital Library
- H. Thane and H. Hansson. Using Deterministic Replay for Debugging of Distributed Real-Time Systems. In Proceedings of the 12th Euromicro Conference on Real-Time System, June 2000. Google ScholarDigital Library
- A. Tucker. Personal communications, June 2009.Google Scholar
- Vmware. http://www.vmware.com.Google Scholar
- M. Xu, R. Bodik, and M. D. Hill. A "Flight Data Recorder" for Enabling Full-System Multiprocessor Deterministic Replay. In Proceedings of the 30th International Symposium on Computer Architecture (ISCA), June 2003. Google ScholarDigital Library
Index Terms
- Transparent, lightweight application execution replay on commodity multiprocessor operating systems
Recommendations
Transparent mutable replay for multicore debugging and patch validation
ASPLOS '13We present Dora, a mutable record-replay system which allows a recorded execution of an application to be replayed with a modified version of the application. This feature, not available in previous record-replay systems, enables powerful new ...
Transparent, lightweight application execution replay on commodity multiprocessor operating systems
Performance evaluation reviewWe present Scribe, the first system to provide transparent, low-overhead application record-replay and the ability to go live from replayed execution. Scribe introduces new lightweight operating system mechanisms, rendezvous and sync points, to ...
Record and transplay: partial checkpointing for replay debugging across heterogeneous systems
SIGMETRICS '11: Proceedings of the ACM SIGMETRICS joint international conference on Measurement and modeling of computer systemsSoftware bugs that occur in production are often difficult to reproduce in the lab due to subtle differences in the application environment and nondeterminism. To address this problem, we present Transplay, a system that captures production software ...
Comments