ABSTRACT
Software-implemented fault tolerance (SIFT) mechanisms allow to tolerate transient hardware faults in commercial off-the-shelf (COTS) systems without using specialized resilient hardware. Unfortunately, existing SIFT methods at both the compiler and the operating system levels are often restricted to single-threaded applications and hence do not apply to multithreaded software on modern multicore platforms.
We present RomainMT, an operating system service that provides replication for unmodified multithreaded applications. Replicating these programs is challenging, because scheduling-induced non-determinism may cause replicated threads to execute different valid code paths. This complicates the distinction between valid behavior and the effects of hardware errors.
RomainMT solves these problems by transparently making multithreaded execution deterministic. We present two alternative mechanisms that differ in the assumptions made about the respective applications and investigate their performance implications. Our evaluation using the SPLASH2 benchmark suite shows that the overhead for triple-modular redundancy (TMR) is 24% for applications with two application threads and 65% for four application threads.
- Arlat, J., Fabre, J.-C., Society, I. C., Rodriguez, M., and Salles, F. Dependability of COTS microkernel-based systems. IEEE Transactions on Computers 51 (2002), 138--163. Google ScholarDigital Library
- ARM. ARM11 MPCore Processor Technical Reference Manual. Technical Documentation at http://infocenter.arm.com, 2008.Google Scholar
- Austin, T. DIVA: a reliable substrate for deep submicron microarchitecture design. In Annual International Symposium on Microarchitecture (1999), pp. 196--207. Google ScholarDigital Library
- Aviram, A., Weng, S.-C., Hu, S., and Ford, B. Efficient system-enforced deterministic parallelism. In Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2010), OSDI'10, USENIX Association, pp. 1--16. Google ScholarDigital Library
- Bergan, T., Hunt, N., Ceze, L., and Gribble, S. D. Deterministic Process Groups in dOS. In Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2010), OSDI'10, USENIX Association, pp. 1--16. Google ScholarDigital Library
- Bernick, D., Bruckert, B., Vigna, P., Garcia, D., Jardine, R., Klecka, J., and Smullen, J. Nonstop: Advanced architecture. In International Conference on Dependable Systems and Networks (june-1 july 2005), pp. 12--21. Google ScholarDigital Library
- Borkar, S. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25, 6 (Nov.-Dec. 2005), 10--16. Google ScholarDigital Library
- Corp., I. Intel64 and IA-32 Architectures Software Developer's Manual. Technical Documentation at http://www.intel.com, 2013.Google Scholar
- Cui, H., Wu, J., Gallagher, J., Guo, H., and Yang, J. Efficient deterministic multithreading through schedule relaxation. In SOSP (2011), T. Wobber and P. Druschel, Eds., ACM, pp. 337--351. Google ScholarDigital Library
- Döbel, B., and Härtig, H. Who watches the watchmen? -- protecting operating system reliability mechanisms. In International Workshop on Hot Topics in System Dependability (HotDep) (2012). Google ScholarDigital Library
- Döbel, B., and Härtig, H. Where have all the cycles gone? -- investigating runtime overheads of os-assisted replication. In Workshop on Software-Based Methods for Robust Embedded Systems (2013), SOBRES'13.Google Scholar
- Döbel, B., Härtig, H., and Engel, M. Operating system support for redundant multithreading. In 12th International Conference on Embedded Software (EMSOFT) (2012). Google ScholarDigital Library
- Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., and Engelmann, C. Combining Partial Redundancy and Checkpointing for HPC. In Conference on Distributed Computing Systems (Macau, SAR, China, June 18-21, 2012), ICDCS '12, IEEE, pp. 615--626. Acceptance rate 13% (71/515). Google ScholarDigital Library
- Ernst, D., Kim, N. S., Das, S., Pant, S., Rao, R., Pham, T., Ziesler, C., Blaauw, D., Austin, T., Flautner, K., and Mudge, T. Razor: a low-power pipeline based on circuit-level timing speculation. In Annual International Symposium on Microarchitecture (dec. 2003), pp. 7--18. Google ScholarDigital Library
- Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., and Burger, D. Dark Silicon and the end of multicore scaling. In Proceedings of the 38th annual international symposium on Computer architecture (New York, NY, USA, 2011), ISCA '11, ACM, pp. 365--376. Google ScholarDigital Library
- Fetzer, C., Schiffel, U., and Süsskraut, M. AN-encoding compiler: Building safety-critical systems with commodity hardware. In International Conference on Computer Safety, Reliability, and Security (Berlin, Heidelberg, 2009), SAFECOMP '09, Springer-Verlag, pp. 283--296. Google ScholarDigital Library
- Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., and Brightwell, R. Detection and correction of silent data corruption for large-scale high-performance computing. In International Conference on High Performance Computing, Networking, Storage and Analysis (Los Alamitos, CA, USA, 2012), SC '12, IEEE Computer Society Press, pp. 78:1--78:12. Google ScholarDigital Library
- Herlihy, M. A methodology for implementing highly concurrent data objects. ACM Trans. Program. Lang. Syst. 15, 5 (Nov. 1993), 745--770. Google ScholarDigital Library
- Hwang, A. A., Stefanovici, I. A., and Schroeder, B. Cosmic rays don't strike twice: understanding the nature of dram errors and the implications for system design. SIGARCH Comput. Archit. News 40, 1 (Mar. 2012), 111--122. Google ScholarDigital Library
- IBM. z/OS -- a smarter operating system for smarter computing. http://www-03.ibm.com/systems/z/os/zos/, 2011.Google Scholar
- Intel. Thread building blocks (TBB). http://www.threadbuildingblocks.org, 2013.Google Scholar
- Kapitza, R., Schunter, M., Cachin, C., Stengel, K., and Distler, T. Storyboard: optimistic deterministic multithreading. In International Conference on Hot Topics in System Dependability (Berkeley, CA, USA, 2010), HotDep'10, USENIX Association, pp. 1--8. Google ScholarDigital Library
- Kaptritsos, M., Wang, Y., Quema, V., Clement, A., Alvisi, L., and Dahlin, M. Eve: Execute-verify replication for multi-core servers. In OSDI 2012 (Oct 2012). Google ScholarDigital Library
- Kleen, A. Linux multi-core scalability. Tech. rep., 2009.Google Scholar
- Leem, L., Cho, H., Bau, J., Jacobson, Q., and Mitra, S. ERSA: Error Resilient System Architecture for Probabilistic Applications. In Design, Automation Test in Europe Conference Exhibition (DATE), 2010 (March 2010), pp. 1560--1565. Google ScholarDigital Library
- Lewis, B., and Berg, D. J. Multithreaded programming with Pthreads. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1998. Google ScholarDigital Library
- Liu, T., Curtsinger, C., and Berger, E. D. Dthreads: efficient deterministic multithreading. In Symposium on Operating Systems Principles (New York, NY, USA, 2011), SOSP '11, ACM, pp. 327--336. Google ScholarDigital Library
- Merrifield, T., and Eriksson, J. Conversion: Multi-version concurrency control for main memory segments. In Proc. of EuroSys 2013 (2013). Google ScholarDigital Library
- Mukherjee, S. Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Google ScholarDigital Library
- Mushtaq, H., Al-Ars, Z., and Bertels, K. Efficient software based fault tolerance approach on multicore platforms. In Proc. Design, Automation & Test in Europe Conference (Grenoble, France, March 2013). Google ScholarDigital Library
- Nistor, A., Marinov, D., and Torrellas, J. Light64: lightweight hardware support for data race detection during systematic testing of parallel programs. In International Symposium on Microarchitecture (New York, NY, USA, 2009), MICRO 42, ACM, pp. 541--552. Google ScholarDigital Library
- Oh, N., Shirvani, P., and McCluskey, E. Control-flow checking by software signatures. IEEE Transactions on Reliability 51, 1 (mar 2002), 111--122.Google ScholarCross Ref
- Olszewski, M., Ansel, J., and Amarasinghe, S. Kendo: efficient deterministic multithreading in software. SIGPLAN Not. 44 (Mar. 2009), 97--108. Google ScholarDigital Library
- Pinheiro, E., Weber, W.-D., and Barroso, L. A. Failure trends in a large disk drive population. In 5th USENIX Conference on File and Storage Technologies (FAST 2007) (2007), pp. 17--29. Google ScholarDigital Library
- Reinhardt, S. K., and Mukherjee, S. S. Transient fault detection via simultaneous multithreading. SIGARCH Comput. Archit. News 28 (May 2000), 25--36. Google ScholarDigital Library
- Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., and August, D. I. SWIFT: Software implemented fault tolerance. In International Symposium on Code Generation and Optimization (2005), IEEE Computer Society, pp. 243--254. Google ScholarDigital Library
- Saggese, G. P., Wang, N. J., Kalbarczyk, Z. T., Patel, S. J., and Iyer, R. K. An experimental study of soft errors in microprocessors. IEEE Micro 25 (November 2005), 30--39. Google ScholarDigital Library
- Schroder, D. K. Negative bias temperature instability: What do we understand? Microelectronics Reliability 47, 6 (2007), 841--852.Google ScholarCross Ref
- Serebryany, K., and Iskhodzhanov, T. Threadsanitizer: data race detection in practice. In Workshop on Binary Instrumentation and Applications (New York, NY, USA, 2009), WBIA '09, ACM, pp. 62--71. Google ScholarDigital Library
- Shye, A., Moseley, T., Reddi, V. J., Blomstedt, J., and Connors, D. A. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In International Conference on Dependable Systems and Networks (Washington, DC, USA, 2007), DSN '07, IEEE Computer Society, pp. 297--306. Google ScholarDigital Library
- Taber, A., and Normand, E. Single event upset in avionics. IEEE Transactions on Nuclear Science 40, 2 (apr 1993), 120--126.Google ScholarCross Ref
- Wang, C., Kim, H.-s., Wu, Y., and Ying, V. Compiler-managed software-based redundant multi-threading for transient fault detection. In International Symposium on Code Generation and Optimization (Washington, DC, USA, 2007), CGO '07, IEEE Computer Society, pp. 244--258. Google ScholarDigital Library
- Wang, N., Fertig, M., and Patel, S. Y-branches: when you come to a fork in the road, take it. In International Conference on Parallel Architectures and Compilation Techniques (sept.-1 oct. 2003), pp. 56--66. Google ScholarDigital Library
- Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. The SPLASH-2 programs: characterization and methodological considerations. In International Symposium on Computer Architecture (New York, NY, USA, 1995), ISCA '95, ACM, pp. 24--36. Google ScholarDigital Library
- Yoshida, J. Toyota Case: The Single Bit Flip That Killed. http://www.eetimes.com/document.asp?doc_id=1319903, Oct. 2013.Google Scholar
- Zhang, Y., Lee, J. W., Johnson, N. P., and August, D. I. Daft: decoupled acyclic fault tolerance. In International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2010), PACT '10, ACM, pp. 87--98. Google ScholarDigital Library
Index Terms
- Can we put concurrency back into redundant multithreading?
Recommendations
Semantics-based concurrency control: beyond commutativity
The concurrency of transactions executing on atomic data types can be enhanced through the use of semantic information about operations defined on these types. Hitherto, commutativity of operations has been exploited to provide enchanced concurrency ...
Comments