skip to main content
research-article

Rebound: scalable checkpointing for coherent shared memory

Authors Info & Claims
Published:04 June 2011Publication History
Skip Abstract Section

Abstract

As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors.

To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.

References

  1. R. Ahmed, R. Frazier, and P. Marinos. Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems. In Int. Symp. on Fault-Tol. Comp. Sys., June 1990.Google ScholarGoogle ScholarCross RefCross Ref
  2. M. Banatre, A. Gefflaut, P. Joubert, C. Morin, and P. Lee. An architecture for tolerating processor failures in shared-memory multiprocessors. IEEE Trans. Comp., 45(10), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Banatre and P. Joubert. Cache management in a tightly coupled fault tolerant multiprocessor. In Int. Symp. on Fault-Tol. Comp. Sys., June 1990.Google ScholarGoogle ScholarCross RefCross Ref
  4. B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7), 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Int. Symp. on Comp. Arch., June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Ceze, J. Tuck, C. Cascaval, and J. Torrellas. Bulk disambiguation of speculative threads in multiprocessors. In Int. Symp. on Comp. Arch., June 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. J. Dell. A white paper on the benefits of Chipkill-correct ECC for PC server main memory. IBM Microelec. Div., Nov 2005.Google ScholarGoogle Scholar
  8. E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comp. Surv., 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. E. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans. on Comp., 41(5), May 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Gupta, W. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Int. Conf. on Par. Proc., Aug 1990.Google ScholarGoogle Scholar
  11. Intel Corporation. Single Chip Cloud Computing (SCC) platform overview, Feb 2010. techresearch.intel.com.Google ScholarGoogle Scholar
  12. B. Janssens and K. Fuchs. The performance of cache-based error recovery in multiprocessors. IEEE Trans. Par. Dist. Syst., 5(10), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, and I. Puaut. A recoverable distributed shared memory integrating coherence and recoverability. In Int. Symp. on Fault-Tol. Comp., June 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. Koo and S. Toueg. Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Soft. Eng., 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Lee and T. Anderson. Fault Tolerance: Principles and Practice. Springer-Verlag, Inc., 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Prog. Lang. Design and Impl., June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Y. Masubuchi, S. Hoshina, T. Shimada, H. Hirayama, and N. Kato. Fault recovery mechanism for multiprocessor servers. In Int. Symp. on Fault-Tol. Comp., June 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Morin, A. Gefflaut, M. Banatre, and A. Kermarrec. COMA: An opportunity for building fault-tolerant scalable shared memory multiprocessors. In Int. Symp. on Comp. Arch., May 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. C. Morin, A. Kermarrec, M. Banatre, and A. Gefflaut. An efficient and scalable approach for implementing fault-tolerant DSM architectures. IEEE Trans. Comp., 49(5), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Mukherjee. Architecture Design for Soft Errors. Elsevier Inc., Burlington, MA, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. ReViveI/O: Efficient handling of I/O in highly-available rollback-recovery servers. In Int. Symp. on High-Perf. Comp. Arch., Feb 2006.Google ScholarGoogle ScholarCross RefCross Ref
  22. J. Plank and K. Li. Faster checkpointing with N+1 parity. In Int. Symp. on Fault-Tol. Comp., June 1994.Google ScholarGoogle ScholarCross RefCross Ref
  23. M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Int. Symp. on Comp. Arch., May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Randell. System structure for software fault tolerance. IEEE Trans. on Soft. Eng., 1(2), June 1975.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Raoux, G. Burr, M. Breitwisch, C. Rettner, Y. Chen, R. Shelby, M. Salinga, D. Krebs, S. Chen, H. Lung, and C. Lam. Phase-change random access memory: A scalable technology. IBM Jou. of Res. and Dev., 52(4/5), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, Jan 2005. http://sesc.sourceforge.net.Google ScholarGoogle Scholar
  27. D. Sorin, M. Martin, M. Hill, and D. Wood. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Int. Symp. on Comp. Arch., May 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. F. Sultan, L. Iftode, and T. Nguyen. Scalable fault-tolerant distributed shared memory. In Int. Conf. on Super., 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. D. Sunada, M. Flynn, and D. Glasco. Multiprocessor architecture using an audit trail for fault tolerance. In Int. Symp. on Fault-Tol. Comp., June 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Tarjan, S. Thoziyoor, and N. Jouppi. CACTI 4.0. Technical report, HPL-2006-86, HP Laboratories, 2006.Google ScholarGoogle Scholar
  31. S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In Int. Sol. State Cir. Conf., Feb 2007.Google ScholarGoogle ScholarCross RefCross Ref
  32. D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob. DRAMsim: A memory system simulator. SIGARCH Comp. Arch. News, 33(4), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. K. Wu, K. Fuchs, and J. Patel. Error recovery in shared memory multiprocessors usingGoogle ScholarGoogle Scholar

Index Terms

  1. Rebound: scalable checkpointing for coherent shared memory

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM SIGARCH Computer Architecture News
        ACM SIGARCH Computer Architecture News  Volume 39, Issue 3
        ISCA '11
        June 2011
        462 pages
        ISSN:0163-5964
        DOI:10.1145/2024723
        Issue’s Table of Contents
        • cover image ACM Conferences
          ISCA '11: Proceedings of the 38th annual international symposium on Computer architecture
          June 2011
          488 pages
          ISBN:9781450304726
          DOI:10.1145/2000064

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 June 2011

        Check for updates

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader