skip to main content
10.1145/2656045.2656050acmconferencesArticle/Chapter ViewAbstractPublication PagesesweekConference Proceedingsconference-collections
research-article

Can we put concurrency back into redundant multithreading?

Published:12 October 2014Publication History

ABSTRACT

Software-implemented fault tolerance (SIFT) mechanisms allow to tolerate transient hardware faults in commercial off-the-shelf (COTS) systems without using specialized resilient hardware. Unfortunately, existing SIFT methods at both the compiler and the operating system levels are often restricted to single-threaded applications and hence do not apply to multithreaded software on modern multicore platforms.

We present RomainMT, an operating system service that provides replication for unmodified multithreaded applications. Replicating these programs is challenging, because scheduling-induced non-determinism may cause replicated threads to execute different valid code paths. This complicates the distinction between valid behavior and the effects of hardware errors.

RomainMT solves these problems by transparently making multithreaded execution deterministic. We present two alternative mechanisms that differ in the assumptions made about the respective applications and investigate their performance implications. Our evaluation using the SPLASH2 benchmark suite shows that the overhead for triple-modular redundancy (TMR) is 24% for applications with two application threads and 65% for four application threads.

References

  1. Arlat, J., Fabre, J.-C., Society, I. C., Rodriguez, M., and Salles, F. Dependability of COTS microkernel-based systems. IEEE Transactions on Computers 51 (2002), 138--163. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. ARM. ARM11 MPCore Processor Technical Reference Manual. Technical Documentation at http://infocenter.arm.com, 2008.Google ScholarGoogle Scholar
  3. Austin, T. DIVA: a reliable substrate for deep submicron microarchitecture design. In Annual International Symposium on Microarchitecture (1999), pp. 196--207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Aviram, A., Weng, S.-C., Hu, S., and Ford, B. Efficient system-enforced deterministic parallelism. In Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2010), OSDI'10, USENIX Association, pp. 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bergan, T., Hunt, N., Ceze, L., and Gribble, S. D. Deterministic Process Groups in dOS. In Conference on Operating Systems Design and Implementation (Berkeley, CA, USA, 2010), OSDI'10, USENIX Association, pp. 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bernick, D., Bruckert, B., Vigna, P., Garcia, D., Jardine, R., Klecka, J., and Smullen, J. Nonstop: Advanced architecture. In International Conference on Dependable Systems and Networks (june-1 july 2005), pp. 12--21. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Borkar, S. Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25, 6 (Nov.-Dec. 2005), 10--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Corp., I. Intel64 and IA-32 Architectures Software Developer's Manual. Technical Documentation at http://www.intel.com, 2013.Google ScholarGoogle Scholar
  9. Cui, H., Wu, J., Gallagher, J., Guo, H., and Yang, J. Efficient deterministic multithreading through schedule relaxation. In SOSP (2011), T. Wobber and P. Druschel, Eds., ACM, pp. 337--351. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Döbel, B., and Härtig, H. Who watches the watchmen? -- protecting operating system reliability mechanisms. In International Workshop on Hot Topics in System Dependability (HotDep) (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Döbel, B., and Härtig, H. Where have all the cycles gone? -- investigating runtime overheads of os-assisted replication. In Workshop on Software-Based Methods for Robust Embedded Systems (2013), SOBRES'13.Google ScholarGoogle Scholar
  12. Döbel, B., Härtig, H., and Engel, M. Operating system support for redundant multithreading. In 12th International Conference on Embedded Software (EMSOFT) (2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Elliott, J., Kharbas, K., Fiala, D., Mueller, F., Ferreira, K., and Engelmann, C. Combining Partial Redundancy and Checkpointing for HPC. In Conference on Distributed Computing Systems (Macau, SAR, China, June 18-21, 2012), ICDCS '12, IEEE, pp. 615--626. Acceptance rate 13% (71/515). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ernst, D., Kim, N. S., Das, S., Pant, S., Rao, R., Pham, T., Ziesler, C., Blaauw, D., Austin, T., Flautner, K., and Mudge, T. Razor: a low-power pipeline based on circuit-level timing speculation. In Annual International Symposium on Microarchitecture (dec. 2003), pp. 7--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Esmaeilzadeh, H., Blem, E., St. Amant, R., Sankaralingam, K., and Burger, D. Dark Silicon and the end of multicore scaling. In Proceedings of the 38th annual international symposium on Computer architecture (New York, NY, USA, 2011), ISCA '11, ACM, pp. 365--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Fetzer, C., Schiffel, U., and Süsskraut, M. AN-encoding compiler: Building safety-critical systems with commodity hardware. In International Conference on Computer Safety, Reliability, and Security (Berlin, Heidelberg, 2009), SAFECOMP '09, Springer-Verlag, pp. 283--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Fiala, D., Mueller, F., Engelmann, C., Riesen, R., Ferreira, K., and Brightwell, R. Detection and correction of silent data corruption for large-scale high-performance computing. In International Conference on High Performance Computing, Networking, Storage and Analysis (Los Alamitos, CA, USA, 2012), SC '12, IEEE Computer Society Press, pp. 78:1--78:12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Herlihy, M. A methodology for implementing highly concurrent data objects. ACM Trans. Program. Lang. Syst. 15, 5 (Nov. 1993), 745--770. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hwang, A. A., Stefanovici, I. A., and Schroeder, B. Cosmic rays don't strike twice: understanding the nature of dram errors and the implications for system design. SIGARCH Comput. Archit. News 40, 1 (Mar. 2012), 111--122. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. IBM. z/OS -- a smarter operating system for smarter computing. http://www-03.ibm.com/systems/z/os/zos/, 2011.Google ScholarGoogle Scholar
  21. Intel. Thread building blocks (TBB). http://www.threadbuildingblocks.org, 2013.Google ScholarGoogle Scholar
  22. Kapitza, R., Schunter, M., Cachin, C., Stengel, K., and Distler, T. Storyboard: optimistic deterministic multithreading. In International Conference on Hot Topics in System Dependability (Berkeley, CA, USA, 2010), HotDep'10, USENIX Association, pp. 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Kaptritsos, M., Wang, Y., Quema, V., Clement, A., Alvisi, L., and Dahlin, M. Eve: Execute-verify replication for multi-core servers. In OSDI 2012 (Oct 2012). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kleen, A. Linux multi-core scalability. Tech. rep., 2009.Google ScholarGoogle Scholar
  25. Leem, L., Cho, H., Bau, J., Jacobson, Q., and Mitra, S. ERSA: Error Resilient System Architecture for Probabilistic Applications. In Design, Automation Test in Europe Conference Exhibition (DATE), 2010 (March 2010), pp. 1560--1565. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lewis, B., and Berg, D. J. Multithreaded programming with Pthreads. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Liu, T., Curtsinger, C., and Berger, E. D. Dthreads: efficient deterministic multithreading. In Symposium on Operating Systems Principles (New York, NY, USA, 2011), SOSP '11, ACM, pp. 327--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Merrifield, T., and Eriksson, J. Conversion: Multi-version concurrency control for main memory segments. In Proc. of EuroSys 2013 (2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mukherjee, S. Architecture Design for Soft Errors. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Mushtaq, H., Al-Ars, Z., and Bertels, K. Efficient software based fault tolerance approach on multicore platforms. In Proc. Design, Automation & Test in Europe Conference (Grenoble, France, March 2013). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Nistor, A., Marinov, D., and Torrellas, J. Light64: lightweight hardware support for data race detection during systematic testing of parallel programs. In International Symposium on Microarchitecture (New York, NY, USA, 2009), MICRO 42, ACM, pp. 541--552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Oh, N., Shirvani, P., and McCluskey, E. Control-flow checking by software signatures. IEEE Transactions on Reliability 51, 1 (mar 2002), 111--122.Google ScholarGoogle ScholarCross RefCross Ref
  33. Olszewski, M., Ansel, J., and Amarasinghe, S. Kendo: efficient deterministic multithreading in software. SIGPLAN Not. 44 (Mar. 2009), 97--108. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Pinheiro, E., Weber, W.-D., and Barroso, L. A. Failure trends in a large disk drive population. In 5th USENIX Conference on File and Storage Technologies (FAST 2007) (2007), pp. 17--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Reinhardt, S. K., and Mukherjee, S. S. Transient fault detection via simultaneous multithreading. SIGARCH Comput. Archit. News 28 (May 2000), 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Reis, G. A., Chang, J., Vachharajani, N., Rangan, R., and August, D. I. SWIFT: Software implemented fault tolerance. In International Symposium on Code Generation and Optimization (2005), IEEE Computer Society, pp. 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Saggese, G. P., Wang, N. J., Kalbarczyk, Z. T., Patel, S. J., and Iyer, R. K. An experimental study of soft errors in microprocessors. IEEE Micro 25 (November 2005), 30--39. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Schroder, D. K. Negative bias temperature instability: What do we understand? Microelectronics Reliability 47, 6 (2007), 841--852.Google ScholarGoogle ScholarCross RefCross Ref
  39. Serebryany, K., and Iskhodzhanov, T. Threadsanitizer: data race detection in practice. In Workshop on Binary Instrumentation and Applications (New York, NY, USA, 2009), WBIA '09, ACM, pp. 62--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Shye, A., Moseley, T., Reddi, V. J., Blomstedt, J., and Connors, D. A. Using process-level redundancy to exploit multiple cores for transient fault tolerance. In International Conference on Dependable Systems and Networks (Washington, DC, USA, 2007), DSN '07, IEEE Computer Society, pp. 297--306. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Taber, A., and Normand, E. Single event upset in avionics. IEEE Transactions on Nuclear Science 40, 2 (apr 1993), 120--126.Google ScholarGoogle ScholarCross RefCross Ref
  42. Wang, C., Kim, H.-s., Wu, Y., and Ying, V. Compiler-managed software-based redundant multi-threading for transient fault detection. In International Symposium on Code Generation and Optimization (Washington, DC, USA, 2007), CGO '07, IEEE Computer Society, pp. 244--258. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Wang, N., Fertig, M., and Patel, S. Y-branches: when you come to a fork in the road, take it. In International Conference on Parallel Architectures and Compilation Techniques (sept.-1 oct. 2003), pp. 56--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Woo, S. C., Ohara, M., Torrie, E., Singh, J. P., and Gupta, A. The SPLASH-2 programs: characterization and methodological considerations. In International Symposium on Computer Architecture (New York, NY, USA, 1995), ISCA '95, ACM, pp. 24--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yoshida, J. Toyota Case: The Single Bit Flip That Killed. http://www.eetimes.com/document.asp?doc_id=1319903, Oct. 2013.Google ScholarGoogle Scholar
  46. Zhang, Y., Lee, J. W., Johnson, N. P., and August, D. I. Daft: decoupled acyclic fault tolerance. In International Conference on Parallel Architectures and Compilation Techniques (New York, NY, USA, 2010), PACT '10, ACM, pp. 87--98. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Can we put concurrency back into redundant multithreading?

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            EMSOFT '14: Proceedings of the 14th International Conference on Embedded Software
            October 2014
            301 pages
            ISBN:9781450330527
            DOI:10.1145/2656045

            Copyright © 2014 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 October 2014

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate60of203submissions,30%

            Upcoming Conference

            ESWEEK '24
            Twentieth Embedded Systems Week
            September 29 - October 4, 2024
            Raleigh , NC , USA

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader