skip to main content
10.1145/2925426.2926264acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

SReplay: Deterministic Sub-Group Replay for One-Sided Communication

Published: 01 June 2016 Publication History

Abstract

Replay of parallel execution is required by HPC debuggers and resilience mechanisms. Up-to-date, there is no existing deterministic replay solution for one-sided communication. The essential problem is that the readers of updated data do not have any information on which remote threads produced the updates, the conventional happens-before based ordering tracking techniques are challenging to work at scale. This paper presents SReplay, the first software tool for sub-group deterministic record and replay for one-sided communication. SReplay allows the user to specify and record the execution of a set of threads of interest (sub-group), and then deterministically replays the execution of the sub-group on a local machine without starting the remaining threads. SReplay ensures sub-group determinism using a hybrid data- and order-replay technique. SReplay maintains scalability by a combination of local logging and approximative event order tracking within sub-group. Our evaluation on deterministic and nondeterministic UPC programs shows that SReplay introduces an overhead ranging from 1.3x to 29x, when running on 1,024 cores and tracking up to 16 threads.

References

[1]
Berkeley UPC. http://upc.lbl.gov.
[2]
Pin - A Dynamic Binary Instrumentation Tool. https://software.intel.com/en-us/articles/pin-a-dynamic-binary-instrumentation-tool.
[3]
Program Record/Replay Toolkit. https://software.intel.com/en-us/articles/program-recordreplay-toolkit.
[4]
The Chapel Parallel Programming Language. http://chapel.cray.com/index.html.
[5]
The NAS Parallel Benchmarks. Available at http://www.nas.nasa.gov/Software/NPB.
[6]
UPC Home Page. http://upc-lang.org.
[7]
UPC Task Library. http://upc.lbl.gov/task.shtml.
[8]
Using PinPlay for Reproducible Analysis and Replay Debugging.
[9]
X10: Performance and Productivity at Scale. http://x10-lang.org.
[10]
MPI: A Message-Passing Interface Standard. Version 3.0. Message Passing Interface Forum, 2012.
[11]
S. V. Adve and K. Gharachorloo. Shared Memory Consistency Models: A Tutorial. Western Reseach Laboratory-Compaq. Research Report 95/7, September 1995.
[12]
D. Arnold, D. Ahn, B. de Supinski, G. Lee, B. Miller, and M. Schulz. Stack Trace Analysis for Large Scale Debugging. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International, 2007.
[13]
A. Basu, J. Bobba, and M. D. Hill. Karma: Scalable Deterministic Record-Replay. In International Conference on Supercomputing, 2011.
[14]
M. D. Bond, M. Kulkarni, M. Cao, M. F. Salmi, and J. Huang. Efficient Deterministic Replay of Multithreaded Programs Based on Efficient Tracking of Cross-Thread Dependences. In Ohio Sate CSE Tech Report OSU-CISRC-12/14-TR20. Ohio State University, 2014.
[15]
A. Bouteiller, G. Bosilca, and J. Dongarra. Retrospect: Deterministic Replay of MPI Applications for Interactive Distributed Debugging. In EuroPVM/MPI, pages 297--306. LNCS, 2007.
[16]
S. Burckhardt, R. Alur, and M. M. K. Martin. CheckFence: Checking consistency of concurrent data types on relaxed memory models. In Prog. Lang. Des. and Impl., Jun 2007.
[17]
Y. Chen, W. Hu, T. Chen, and R. Wu. LReplay: A Pending Period based Deterministic Replay Scheme. In International Symposium on Computer Architecture, 2010.
[18]
J.-D. Choi and H. Srinivasan. Deterministic Replay of Java Multithreaded Applications. In Proceedings of the SIGMETRICS Symposium on Parallel and Distributed Tools, SPDT '98, 1998.
[19]
J. Chung, I. Lee, M. Sullivan, J. H. Ryoo, D. W. Kim, D. H. Yoon, L. Kaplan, and M. Erez. Containment Domains: A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, 2012.
[20]
D. Cunningham, D. Grove, B. Herta, A. Iyengar, K. Kawachiya, H. Murata, V. A. Saraswat, M. Takeuchi, and O. Tardieu. Resilient X10: Efficient Failure-aware Programming. In Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, February 2014.
[21]
D. Eachempati, A. Richardson, S. Jana, T. Liao, H. Calandra, and B. M. Chapman. A Coarray Fortran Implementation to Support Data-intensive Application Development. Cluster Computing, 17(2):569--583, 2014.
[22]
E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A Survey of Rollback-recovery Protocols in Message-passing Systems. ACM Computing Surveys, 34(3):375--408, September 2002.
[23]
E. Georganas, A. Buluç, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick. Parallel De Bruijn Graph Construction and Traversal for De Novo Genome Assembly. In Proceedings of the 26th ACM/IEEE International Conference on High Performance Computing, Networking, Storage and Analysis (SC), November 2014.
[24]
E. Georganas, A. Buluç, J. Chapman, L. Oliker, D. Rokhsar, and K. Yelick. Parallel de bruijn graph construction and traversal for de novo genome assembly. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '14, 2014.
[25]
P. Hao, P. Shamis, M. G. Venkata, S. Pophale, A. Welch, S. W. Poole, and B. M. Chapman. Fault Tolerance for OpenSHMEM. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS), Oct 2014.
[26]
T. Hoefler, J. Dinan, R. Thakur, B. B. P. Balaji, W. Gropp, and K. D. Underwood. Remote Memory Access Programming in MPI-3. IEEE Transactions on Parallel Computing, 2(2), Sept 12015.
[27]
N. Honarmand, N. Dautenhahn, J. Torrellas, S. King, G. Pokam, and C. Pereira. Cyrus: Unintrusive Application-Level Record-Replay for Replay Parallelism. In International conference on Architectural Support for Programming Language s and Operating Systems, 2013.
[28]
N. Honarmand and J. Torrellas. RelaxReplay: Record and Replay for Relaxed-Consistency Multiprocessors. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2014.
[29]
D. R. Hower and M. D. Hill. Rerun: Exploiting Episodes for Lightweight Memory Race Recording. In International Symposium on Computer Architecture, 2008.
[30]
J. Huang, C. Zhang, and J. Dolby. CLAP: Recording Local Executions to Reproduce Concurrency Failures. In ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013.
[31]
M. M. Islam and A. Muzahid. Characterizing Real World Bugs Causing Sequential Consistency Violations. In Proceedings of Hot Topics in Parallelism (HotPar), June 2013.
[32]
W. Kuchera and C. Wallace. The UPC Memory Model: Problems and Prospects. In Proceedings of the 18th International Parallel and Distributed Processing Symposium, Apr 2004.
[33]
O. Laadan, N. Viennot, and J. Nieh. Transparent, Lightweight Application Execution Replay on Commodity Multiprocessor Operating Systems. In ACM SIGMETRICS'10, 2010.
[34]
T. J. LeBlanc and J. M. Mellor-Crummey. Debugging Parallel Programs with Instant Replay. IEEE Transactions on Computers, 36(4):471--482, April 1987.
[35]
N. Machado, L. Rodrigues, and B. Lucia. Concurrency Debugging with Differential Schedule Projections. In Proceedings of 36th annual ACM SIGPLAN conference on Programming Language Design and Implementation, June 2015.
[36]
D. Manivannan and M. Singhal. Quasi-Synchronous Checkpointing: Models, Characterization, and Classification. IEEE Transactions on Parallel Distributed Systems, 10(7):703--713, July 1999.
[37]
J. Mellor-Crummey, L. Adhianto, G. Jin, and W. N. S. III. A New Vision for Coarray Fortran. In The Third Conference on Partitioned Global Address Space Programming Models (PGAS), October 2009.
[38]
S.-J. Min, C. Iancu, and K. Yelick. Hierarchical Work Stealing on Manycore Clusters. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS), Oct 2011.
[39]
P. Montesinos, L. Ceze, and J. Torrellas. DeLorean: Recording and Deterministically Replaying Shared-Memory Multiprocessor Execution Efficiently. In International Symposium on Computer Architecture, 2008.
[40]
L. D. Moura and N. Bjørner. Z3: An Efficient SMT Solver. In TACAS'08/ETAPS'08 Proceedings of the Theory and practice of software, 14th international conference on Tools and Algorithms for the Construction and Analysis of Systems, 2008.
[41]
A. Muzahid, S. Qi, and J. Torrellas. Vulcan: Hardware Support for Detecting Sequential Consistency Violations Dynamically. In International Symposium on Microarchitecture, December 2012.
[42]
N. Namashivayam, S. Ghosh, D. Khaldi, D. Eachempati, and B. M. Chapman. Native Mode-Based Optimizations of Remote Memory Accesses in OpenSHMEM for Intel Xeon Phi. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS), Oct 2014.
[43]
S. Narayanasamy, C. Pereira, and B. Calder. Recording Shared Memory Dependencies Using Strata. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2006.
[44]
S. Narayanasamy, C. Pereira, H. Patil, R. Cohn, and B. Calder. Automatic Logging of Operating System Effects to Guide Application-level Architecture Simulation. In Proceedings of the Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '06/Performance '06, 2006.
[45]
S. Olivier, J. Huan, J. Liu, J. Prins, J. Dinan, P. Sadayappan, and C.-W. Tseng. UTS: An Unbalanced Tree Search Benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing, LCPC'06, 2007.
[46]
C.-S. Park, K. Sen, and C. Iancu. Scalable Data Race Detection for Partitioned Global Address Space Programs. SIGPLAN Not.
[47]
H. Patil, C. Pereira, M. Stallcup, G. Lueck, and J. Cownie. PinPlay: A Framework for Deterministic Replay and Reproducible Analysis of Parallel Programs. In Proceedings of the 8th Annual IEEE/ACM International Symposium on Code Generation and Optimization, CGO '10, 2010.
[48]
G. Pokam, C. Pereira, K. Danne, R. Kassa, and A.-R. Adl-Tabatabai. Architecting a Chunk-based Memory Race Recorder in Modern CMPs. In International Symposium on Microarchitecture, 2009.
[49]
G. Pokam, C. Pereira, S. Hu, A.-R. Adl-Tabatabai, J. Gottschlich, J. Ha, and Y. Wu. CoreRacer: A Practical Memory Race Recorder for Multicore X86 TSO Processors. In International Symposium on Microarchitecture, 2011.
[50]
X. Qian, H. Huang, B. Sahelices, and D. Qian. Rainbow: Efficient Memory Race Recording with High Replay Parallelism for Relaxed Memory Model. In International Symposium on High Performance Computer Architecture, 2013.
[51]
X. Qian, B. Sahelices, and D. Qian. Pacifier: Record and Replay for Relaxed-Consistency Multiprocessors with Distributed Directory Protocol. In International Symposium on Computer Architecture, 2014.
[52]
X. Qian, B. Sahelices, J. Torrellas, and D. Qian. Volition: Scalable and Precise Sequential Cons istency Violation Detection. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2013.
[53]
R. Schwarz and F. Mattern. Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail. Distributed Computing, 7(3):149--174, March 1994.
[54]
K. Sen. Scalable Automated Methods for Dynamic Program Analysis. In Ph.D Thesis. University of Illinois, Urbana-Champaign, 2006.
[55]
K. Sen, G. Rosu, and G. Agha. Runtime safety analysis of multithreaded programs. In ESEC/SIGSOFT FSE, pages 337--346, 2003.
[56]
J. Sloan, R. Kumar, and G. Bronevetsky. Large Scale Debugging of Parallel Tasks with AutomaDeD. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '11, 2011.
[57]
J. Sloan, R. Kumar, and G. Bronevetsky. An Algorithmic Approach to Error Localization and Partial Recomputation for Low-overhead Fault Tolerance. In The 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2013.
[58]
C. Svensson, D. Kesler, R. Kumar, and G. Pokam. MPreplay: Architecture Support for Deterministic Replay of Message Passing Programs on Message Passing Many-core Processors. In UIUC Technical Report UILU-09-2209, Apr 2009.
[59]
O. Tardieu, B. Herta, D. Cunningham, D. Grove, P. Kambadur, V. A. Saraswat, A. Shinnar, M. Takeuchi, and M. Vaziri. X10 and APGAS at Petascale. In Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, February 2014.
[60]
G. Voskuilen, F. Ahmad, and T. N. Vijaykumar. Timetraveler: Exploiting Acyclic Races for Optimizing Memory Race Recording. In International Symposium on Computer Architecture, 2010.
[61]
P. Wang, H. Jiang, X. Liu, and J. Han. Towards Hybrid Programming in Big Data. In Proceedings of 7th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), June 2015.
[62]
A. Welch, S. Pophale, P. Shamis, O. R. Hernandez, S. W. Poole, and B. M. Chapman. Extending the OpenSHMEM Memory Model to Support User-Defined Spaces. In Proceedings of the Fifth Conference on Partitioned Global Address Space Programming Models (PGAS), Oct 2014.
[63]
M. Xu, R. Bodik, and M. D. Hill. A "Flight Data Recorder" for Enabling Full-system Multiprocessor Deterministic Replay. In International Symposium on Computer Architecture, 2003.
[64]
M. Xu, R. Bodik, and M. D. Hill. A Regulated Transitive Reduction (RTR) for Longer Memory Race Recording. In International Conference on Architectural Support for Programming Languages and Operating Systems, 2006.
[65]
R. Xue, X. Liu, M. Wu, Z. Guo, W. Chen, W. Zheng, and G. Voelker. MPIWiz: Subgroup Reproducible Replay of MPI Applications. In Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 251--260. ACM, February 2009.
[66]
K. Yelick, D. Bonachea, and C. Wallace. A Proposal for a UPC Memory Consistency Model, v1.0. In Lawrence Berkeley National Lab Tech Report LBNL-54983, May 2004.

Cited By

View all
  • (2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingThe International Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 5-Apr-2023
  • (2022)Debugging MPI Implementations via Reduction-to-Primitives2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)10.1109/SuperCheck56652.2022.00007(1-9)Online publication date: Nov-2022
  • (2021)Identifying Degree and Sources of Non-Determinism in MPI Applications Via Graph KernelsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308153032:12(2936-2952)Online publication date: 1-Dec-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426
© 2016 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Debugging Tools
  2. One-Sided Communication
  3. PGAS

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Funding Sources

Conference

ICS '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)128
  • Downloads (Last 6 weeks)15
Reflects downloads up to 09 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)A Survey of Graph Comparison Methods with Applications to Nondeterminism in High-Performance ComputingThe International Journal of High Performance Computing Applications10.1177/1094342023116661037:3-4(306-327)Online publication date: 5-Apr-2023
  • (2022)Debugging MPI Implementations via Reduction-to-Primitives2022 IEEE/ACM Third International Symposium on Checkpointing for Supercomputing (SuperCheck)10.1109/SuperCheck56652.2022.00007(1-9)Online publication date: Nov-2022
  • (2021)Identifying Degree and Sources of Non-Determinism in MPI Applications Via Graph KernelsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2021.308153032:12(2936-2952)Online publication date: 1-Dec-2021
  • (2018)Record-and-Replay Techniques for HPC Systems: A SurveySupercomputing Frontiers and Innovations: an International Journal10.14529/jsfi1801025:1(11-30)Online publication date: 15-Mar-2018

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media