research-article

Public Access

SReplay: Deterministic Sub-Group Replay for One-Sided Communication

Authors:

Costin IancuAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 17, Pages 1 - 13

https://doi.org/10.1145/2925426.2926264

Published: 01 June 2016 Publication History

Abstract

Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up-to-date, there is no existing deterministic replay solution for one-sided communication. The essential problem is that the readers of updated data do not have any information on which remote threads produced the updates, the conventional happens-before based ordering tracking techniques are challenging to work at scale. This paper presents SReplay, the first software tool for sub-group deterministic record and replay for one-sided communication. SReplay allows the user to specify and record the execution of a set of threads of interest (sub-group), and then deterministically replays the execution of the sub-group on a local machine without starting the remaining threads. SReplay ensures sub-group determinism using a hybrid data- and order-replay technique. SReplay maintains scalability by a combination of local logging and approximative event order tracking within sub-group. Our evaluation on deterministic and nondeterministic UPC programs shows that SReplay introduces an overhead ranging from 1.3x to 29x, when running on 1,024 cores and tracking up to 16 threads.

References

[1]

Berkeley UPC. http://upc.lbl.gov.

[2]

Pin - A Dynamic Binary Instrumentation Tool. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.

[3]

Program Record/Replay Toolkit. https://software.intel.com/en-us/articles/program-recordreplay-toolkit.

[4]

The Chapel Parallel Programming Language. http://chapel.cray.com/index.html.

[5]

The NAS Parallel Benchmarks. Available at http://www.nas.nasa.gov/Software/NPB.

[6]

UPC Home Page. http://upc-lang.org.

[7]

UPC Task Library. http://upc.lbl.gov/task.shtml.

[8]

Using PinPlay for Reproducible Analysis and Replay Debugging.

[9]

X10: Performance and Productivity at Scale. http://x10-lang.org.

[10]

MPI: A Message-Passing Interface Standard. Version 3.0. Message Passing Interface Forum, 2012.

[11]

S. V. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial. Western Reseach Laboratory-Compaq. Research Report 95/7, September 1995.

[12]

D. Arnold, D. Ahn, B. de Supinski, G. Lee, B. Miller, and M. Schulz. Stack Trace Analysis for Large Scale Debugging. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, 2007.

[13]

A. Basu, J. Bobba, and M. D. Hill. Karma: Scalable Deterministic Record-Replay. In International Conference on Supercomputing, 2011.

Digital Library

[14]

M. D. Bond, M. Kulkarni, M. Cao, M. F. Salmi, and J. Huang. Efficient Deterministic Replay of Multithreaded Programs Based on Efficient Tracking of Cross-Thread Dependences. In Ohio Sate CSE Tech Report OSU-CISRC-12/14-TR20. Ohio State University, 2014.

[15]

A. Bouteiller, G. Bosilca, and J. Dongarra. Retrospect: Deterministic Replay of MPI Applications for Interactive Distributed Debugging. In EuroPVM/MPI, pages 297--306. LNCS, 2007.

Digital Library

[16]

S. Burckhardt, R. Alur, and M. M. K. Martin. CheckFence: Checking consistency of concurrent data types on relaxed memory models. In Prog. Lang. Des. and Impl., Jun 2007.

Digital Library

[17]

Y. Chen, W. Hu, T. Chen, and R. Wu. LReplay: A Pending Period based Deterministic Replay Scheme. In International Symposium on Computer Architecture, 2010.

Digital Library

[18]

J.-D. Choi and H. Srinivasan. Deterministic Replay of Java Multithreaded Applications. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools, SPDT '98, 1998.

Digital Library

[19]

J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, 2012.

Digital Library

[20]

D. Cunningham, D. Grove, B. Herta, A. Iyengar, K. Kawachiya, H. Murata, V. A. Saraswat, M. Takeuchi, and O. Tardieu. Resilient X10: Efficient Failure-aware Programming. In Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, February 2014.

Digital Library

[21]

D. Eachempati, A. Richardson, S. Jana, T. Liao, H. Calandra, and B. M. Chapman. A Coarray Fortran Implementation to Support Data-intensive Application Development. Cluster Computing, 17(2):569--583, 2014.

Digital Library

[22]

E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A Survey of Rollback-recovery Protocols in Message-passing Systems. ACM Computing Surveys, 34(3):375--408, September 2002.

Digital Library

[23]

E. Georganas, A. Buluç, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick. Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly. In Proceedings of the 26th ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2014.

Digital Library

[24]

E. Georganas, A. Buluç, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick. Parallel de bruijn graph construction and traversal for de novo genome assembly. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, 2014.

Digital Library

[25]

P. Hao, P. Shamis, M. G. Venkata, S. Pophale, A. Welch, S. W. Poole, and B. M. Chapman. Fault Tolerance for OpenSHMEM. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS), Oct 2014.

Digital Library

[26]

T. Hoefler, J. Dinan, R. Thakur, B. B. P. Balaji, W. Gropp, and K. D. Underwood. Remote Memory Access Programming in MPI-3. IEEE Transactions on Parallel Computing, 2(2), Sept 12015.

Digital Library

[27]

N. Honarmand, N. Dautenhahn, J. Torrellas, S. King, G. Pokam, and C. Pereira. Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism. In International conference on Architectural Support for Programming Language s and Operating Systems, 2013.

Digital Library

[28]

N. Honarmand and J. Torrellas. RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2014.

Digital Library

[29]

D. R. Hower and M. D. Hill. Rerun: Exploiting Episodes for Lightweight Memory Race Recording. In International Symposium on Computer Architecture, 2008.

Digital Library

[30]

J. Huang, C. Zhang, and J. Dolby. CLAP: Recording Local Executions to Reproduce Concurrency Failures. In ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013.

Digital Library

[31]

M. M. Islam and A. Muzahid. Characterizing Real World Bugs Causing Sequential Consistency Violations. In Proceedings of Hot Topics in Parallelism (HotPar), June 2013.

[32]

W. Kuchera and C. Wallace. The UPC Memory Model: Problems and Prospects. In Proceedings of the 18th International Parallel and Distributed Processing Symposium, Apr 2004.

[33]

O. Laadan, N. Viennot, and J. Nieh. Transparent, Lightweight Application Execution Replay on Commodity Multiprocessor Operating Systems. In ACM SIGMETRICS'10, 2010.

Digital Library

[34]

T. J. LeBlanc and J. M. Mellor-Crummey. Debugging Parallel Programs with Instant Replay. IEEE Transactions on Computers, 36(4):471--482, April 1987.

Digital Library

[35]

N. Machado, L. Rodrigues, and B. Lucia. Concurrency Debugging with Differential Schedule Projections. In Proceedings of 36th annual ACM SIGPLAN conference on Programming Language Design and Implementation, June 2015.

Digital Library

[36]

D. Manivannan and M. Singhal. Quasi-Synchronous Checkpointing: Models, Characterization, and Classification. IEEE Transactions on Parallel Distributed Systems, 10(7):703--713, July 1999.

Digital Library

[37]

J. Mellor-Crummey, L. Adhianto, G. Jin, and W. N. S. III. A New Vision for Coarray Fortran. In The Third Conference on Partitioned Global Address Space Programming Models (PGAS), October 2009.

Digital Library

[38]

S.-J. Min, C. Iancu, and K. Yelick. Hierarchical Work Stealing on Manycore Clusters. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS), Oct 2011.

[39]

P. Montesinos, L. Ceze, and J. Torrellas. DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Efficiently. In International Symposium on Computer Architecture, 2008.

Digital Library

[40]

L. D. Moura and N. Bjørner. Z3: An Efficient SMT Solver. In TACAS'08/ETAPS'08 Proceedings of the Theory and practice of software, 14th international conference on Tools and Algorithms for the Construction and Analysis of Systems, 2008.

Digital Library

[41]

A. Muzahid, S. Qi, and J. Torrellas. Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically. In International Symposium on Microarchitecture, December 2012.

Digital Library

[42]

N. Namashivayam, S. Ghosh, D. Khaldi, D. Eachempati, and B. M. Chapman. Native Mode-Based Optimizations of Remote Memory Accesses in OpenSHMEM for Intel Xeon Phi. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS), Oct 2014.

Digital Library

[43]

S. Narayanasamy, C. Pereira, and B. Calder. Recording Shared Memory Dependencies Using Strata. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2006.

Digital Library

[44]

S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic Logging of Operating System Effects to Guide Application-level Architecture Simulation. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '06/Performance '06, 2006.

Digital Library

[45]

S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An Unbalanced Tree Search Benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing, LCPC'06, 2007.

Digital Library

[46]

C.-S. Park, K. Sen, and C. Iancu. Scalable Data Race Detection for Partitioned Global Address Space Programs. SIGPLAN Not.

Digital Library

[47]

H. Patil, C. Pereira, M. Stallcup, G. Lueck, and J. Cownie. PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '10, 2010.

Digital Library

[48]

G. Pokam, C. Pereira, K. Danne, R. Kassa, and A.-R. Adl-Tabatabai. Architecting a Chunk-based Memory Race Recorder in Modern CMPs. In International Symposium on Microarchitecture, 2009.

Digital Library

[49]

G. Pokam, C. Pereira, S. Hu, A.-R. Adl-Tabatabai, J. Gottschlich, J. Ha, and Y. Wu. CoreRacer: A Practical Memory Race Recorder for Multicore X86 TSO Processors. In International Symposium on Microarchitecture, 2011.

Digital Library

[50]

X. Qian, H. Huang, B. Sahelices, and D. Qian. Rainbow: Efficient Memory Race Recording with High Replay Parallelism for Relaxed Memory Model. In International Symposium on High Performance Computer Architecture, 2013.

Digital Library

[51]

X. Qian, B. Sahelices, and D. Qian. Pacifier: Record and Replay for Relaxed-Consistency Multiprocessors with Distributed Directory Protocol. In International Symposium on Computer Architecture, 2014.

Digital Library

[52]

X. Qian, B. Sahelices, J. Torrellas, and D. Qian. Volition: Scalable and Precise Sequential Cons istency Violation Detection. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2013.

Digital Library

[53]

R. Schwarz and F. Mattern. Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail. Distributed Computing, 7(3):149--174, March 1994.

Digital Library

[54]

K. Sen. Scalable Automated Methods for Dynamic Program Analysis. In Ph.D Thesis. University of Illinois, Urbana-Champaign, 2006.

Digital Library

[55]

K. Sen, G. Rosu, and G. Agha. Runtime safety analysis of multithreaded programs. In ESEC/SIGSOFT FSE, pages 337--346, 2003.

Digital Library

[56]

J. Sloan, R. Kumar, and G. Bronevetsky. Large Scale Debugging of Parallel Tasks with AutomaDeD. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '11, 2011.

Digital Library

[57]

J. Sloan, R. Kumar, and G. Bronevetsky. An Algorithmic Approach to Error Localization and Partial Recomputation for Low-overhead Fault Tolerance. In The 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2013.

Digital Library

[58]

C. Svensson, D. Kesler, R. Kumar, and G. Pokam. MPreplay: Architecture Support for Deterministic Replay of Message Passing Programs on Message Passing Many-core Processors. In UIUC Technical Report UILU-09-2209, Apr 2009.

[59]

O. Tardieu, B. Herta, D. Cunningham, D. Grove, P. Kambadur, V. A. Saraswat, A. Shinnar, M. Takeuchi, and M. Vaziri. X10 and APGAS at Petascale. In Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, February 2014.

Digital Library

[60]

G. Voskuilen, F. Ahmad, and T. N. Vijaykumar. Timetraveler: Exploiting Acyclic Races for Optimizing Memory Race Recording. In International Symposium on Computer Architecture, 2010.

Digital Library

[61]

P. Wang, H. Jiang, X. Liu, and J. Han. Towards Hybrid Programming in Big Data. In Proceedings of 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), June 2015.

Digital Library

[62]

A. Welch, S. Pophale, P. Shamis, O. R. Hernandez, S. W. Poole, and B. M. Chapman. Extending the OpenSHMEM Memory Model to Support User-Defined Spaces. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS), Oct 2014.

Digital Library

[63]

M. Xu, R. Bodik, and M. D. Hill. A "Flight Data Recorder" for Enabling Full-system Multiprocessor Deterministic Replay. In International Symposium on Computer Architecture, 2003.

Digital Library

[64]

M. Xu, R. Bodik, and M. D. Hill. A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2006.

Digital Library

[65]

R. Xue, X. Liu, M. Wu, Z. Guo, W. Chen, W. Zheng, and G. Voelker. MPIWiz: Subgroup Reproducible Replay of MPI Applications. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 251--260. ACM, February 2009.

Digital Library

[66]

K. Yelick, D. Bonachea, and C. Wallace. A Proposal for a UPC Memory Consistency Model, v1.0. In Lawrence Berkeley National Lab Tech Report LBNL-54983, May 2004.

Cited By

Bhowmick SBell PTaufer M(2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingThe International Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 5-Apr-2023
https://doi.org/10.1177/10943420231166610
Cooperman GLi DZhao Z(2022)Debugging MPI Implementations via Reduction-to-Primitives2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)10.1109/SuperCheck56652.2022.00007(1-9)Online publication date: Nov-2022
https://doi.org/10.1109/SuperCheck56652.2022.00007
Chapp DTan NBhowmick STaufer M(2021)Identifying Degree and Sources of Non-Determinism in MPI Applications Via Graph KernelsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308153032:12(2936-2952)Online publication date: 1-Dec-2021
https://doi.org/10.1109/TPDS.2021.3081530
Show More Cited By

SReplay: Deterministic Sub-Group Replay for One-Sided Communication
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis

Recommendations

Productivity and performance using partitioned global address space languages
PASCO '07: Proceedings of the 2007 international workshop on Parallel symbolic computation

Partitioned Global Address Space (PGAS) languages combine the programming convenience of shared memory with the locality and performance control of message passing. One such language, Unified Parallel C (UPC) is an extension of ISO C defined by a ...
Asynchronous PGAS runtime for Myrinet networks
PGAS '10: Proceedings of the Fourth Conference on Partitioned Global Address Space Programming Model

PGAS languages aim to enhance productivity for large scale systems. The IBM Asynchronous PGAS runtime (APGAS) supports various high productivity programming languages including UPC, X10 and CAF. The runtime has been designed for scalability and ...
DART-MPI: An MPI-based Implementation of a PGAS Runtime System
PGAS '14: Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models

A Partitioned Global Address Space (PGAS) approach treats a distributed system as if the memory were shared on a global level. Given such a global view on memory, the user may program applications very much like shared memory systems. This greatly ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

U.S. Department of Energy

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
514
Total Downloads

Downloads (Last 12 months)128
Downloads (Last 6 weeks)15

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Bhowmick SBell PTaufer M(2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingThe International Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 5-Apr-2023
https://doi.org/10.1177/10943420231166610
Cooperman GLi DZhao Z(2022)Debugging MPI Implementations via Reduction-to-Primitives2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)10.1109/SuperCheck56652.2022.00007(1-9)Online publication date: Nov-2022
https://doi.org/10.1109/SuperCheck56652.2022.00007
Chapp DTan NBhowmick STaufer M(2021)Identifying Degree and Sources of Non-Determinism in MPI Applications Via Graph KernelsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308153032:12(2936-2952)Online publication date: 1-Dec-2021
https://doi.org/10.1109/TPDS.2021.3081530
Chapp DSato KAhn DTaufer M(2018)Record-and-Replay Techniques for HPC Systems: A SurveySupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1801025:1(11-30)Online publication date: 15-Mar-2018
https://dl.acm.org/doi/10.14529/jsfi180102

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten