Abstract
As we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors.
To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.
- R. Ahmed, R. Frazier, and P. Marinos. Cache-aided rollback error recovery (CARER) algorithm for shared-memory multiprocessor systems. In Int. Symp. on Fault-Tol. Comp. Sys., June 1990.Google ScholarCross Ref
- M. Banatre, A. Gefflaut, P. Joubert, C. Morin, and P. Lee. An architecture for tolerating processor failures in shared-memory multiprocessors. IEEE Trans. Comp., 45(10), 1996. Google ScholarDigital Library
- M. Banatre and P. Joubert. Cache management in a tightly coupled fault tolerant multiprocessor. In Int. Symp. on Fault-Tol. Comp. Sys., June 1990.Google ScholarCross Ref
- B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun. ACM, 13(7), 1970. Google ScholarDigital Library
- D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Int. Symp. on Comp. Arch., June 2000. Google ScholarDigital Library
- L. Ceze, J. Tuck, C. Cascaval, and J. Torrellas. Bulk disambiguation of speculative threads in multiprocessors. In Int. Symp. on Comp. Arch., June 2006. Google ScholarDigital Library
- T. J. Dell. A white paper on the benefits of Chipkill-correct ECC for PC server main memory. IBM Microelec. Div., Nov 2005.Google Scholar
- E. Elnozahy, L. Alvisi, Y. Wang, and D. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comp. Surv., 1992. Google ScholarDigital Library
- E. Elnozahy and W. Zwaenepoel. Manetho: Transparent rollback-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans. on Comp., 41(5), May 1992. Google ScholarDigital Library
- A. Gupta, W. Weber, and T. Mowry. Reducing memory and traffic requirements for scalable directory-based cache coherence schemes. In Int. Conf. on Par. Proc., Aug 1990.Google Scholar
- Intel Corporation. Single Chip Cloud Computing (SCC) platform overview, Feb 2010. techresearch.intel.com.Google Scholar
- B. Janssens and K. Fuchs. The performance of cache-based error recovery in multiprocessors. IEEE Trans. Par. Dist. Syst., 5(10), 1994. Google ScholarDigital Library
- A. Kermarrec, G. Cabillic, A. Gefflaut, C. Morin, and I. Puaut. A recoverable distributed shared memory integrating coherence and recoverability. In Int. Symp. on Fault-Tol. Comp., June 1995. Google ScholarDigital Library
- R. Koo and S. Toueg. Checkpointing and rollback-recovery for distributed systems. IEEE Trans. Soft. Eng., 1987. Google ScholarDigital Library
- P. Lee and T. Anderson. Fault Tolerance: Principles and Practice. Springer-Verlag, Inc., 1990. Google ScholarDigital Library
- C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Prog. Lang. Design and Impl., June 2005. Google ScholarDigital Library
- Y. Masubuchi, S. Hoshina, T. Shimada, H. Hirayama, and N. Kato. Fault recovery mechanism for multiprocessor servers. In Int. Symp. on Fault-Tol. Comp., June 1997. Google ScholarDigital Library
- C. Morin, A. Gefflaut, M. Banatre, and A. Kermarrec. COMA: An opportunity for building fault-tolerant scalable shared memory multiprocessors. In Int. Symp. on Comp. Arch., May 1996. Google ScholarDigital Library
- C. Morin, A. Kermarrec, M. Banatre, and A. Gefflaut. An efficient and scalable approach for implementing fault-tolerant DSM architectures. IEEE Trans. Comp., 49(5), 2000. Google ScholarDigital Library
- S. Mukherjee. Architecture Design for Soft Errors. Elsevier Inc., Burlington, MA, USA, 2008. Google ScholarDigital Library
- J. Nakano, P. Montesinos, K. Gharachorloo, and J. Torrellas. ReViveI/O: Efficient handling of I/O in highly-available rollback-recovery servers. In Int. Symp. on High-Perf. Comp. Arch., Feb 2006.Google ScholarCross Ref
- J. Plank and K. Li. Faster checkpointing with N+1 parity. In Int. Symp. on Fault-Tol. Comp., June 1994.Google ScholarCross Ref
- M. Prvulovic, Z. Zhang, and J. Torrellas. ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors. In Int. Symp. on Comp. Arch., May 2002. Google ScholarDigital Library
- B. Randell. System structure for software fault tolerance. IEEE Trans. on Soft. Eng., 1(2), June 1975.Google ScholarDigital Library
- S. Raoux, G. Burr, M. Breitwisch, C. Rettner, Y. Chen, R. Shelby, M. Salinga, D. Krebs, S. Chen, H. Lung, and C. Lam. Phase-change random access memory: A scalable technology. IBM Jou. of Res. and Dev., 52(4/5), 2008. Google ScholarDigital Library
- J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos. SESC simulator, Jan 2005. http://sesc.sourceforge.net.Google Scholar
- D. Sorin, M. Martin, M. Hill, and D. Wood. SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery. In Int. Symp. on Comp. Arch., May 2002. Google ScholarDigital Library
- F. Sultan, L. Iftode, and T. Nguyen. Scalable fault-tolerant distributed shared memory. In Int. Conf. on Super., 2000. Google ScholarDigital Library
- D. Sunada, M. Flynn, and D. Glasco. Multiprocessor architecture using an audit trail for fault tolerance. In Int. Symp. on Fault-Tol. Comp., June 1999. Google ScholarDigital Library
- D. Tarjan, S. Thoziyoor, and N. Jouppi. CACTI 4.0. Technical report, HPL-2006-86, HP Laboratories, 2006.Google Scholar
- S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, and N. Borkar. An 80-tile 1.28TFLOPS network-on-chip in 65nm CMOS. In Int. Sol. State Cir. Conf., Feb 2007.Google ScholarCross Ref
- D. Wang, B. Ganesh, N. Tuaycharoen, K. Baynes, A. Jaleel, and B. Jacob. DRAMsim: A memory system simulator. SIGARCH Comp. Arch. News, 33(4), 2005. Google ScholarDigital Library
- K. Wu, K. Fuchs, and J. Patel. Error recovery in shared memory multiprocessors usingGoogle Scholar
Index Terms
Rebound: scalable checkpointing for coherent shared memory
Recommendations
Rebound: scalable checkpointing for coherent shared memory
ISCA '11: Proceedings of the 38th annual international symposium on Computer architectureAs we move to large manycores, the hardware-based global check-pointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in ...
Maintaining Cache Coherence through Compiler-Directed Data Prefetching
In this paper, we propose a compiler-directed cache coherence scheme which makes use of data prefetching to enforce cache coherence in large-scale distributed shared-memory (DSM) systems. TheCache Coherence With Data Prefetching(CCDP) scheme uses ...
Efficient Integration of Compiler-Directed Cache Coherence and Data Prefetching
Cache coherence enforcement and memory latency reduction and hiding are very important and challenging problems in the design of large-scale distributed shared-memory (DSM) multiprocessors. We propose an integrated approach to solve these problems ...
Comments