ABSTRACT
Parallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel applications. Hand-extracted synthetic benchmarks are time- and labor-intensive to create. Real applications themselves, while offering most accurate performance evaluation, are expensive to compile, port, reconfigure, and often plainly inaccessible due to security or ownership concerns. This work contributes APPrime, a novel tool for trace-based automatic parallel benchmark generation. Taking as input standard communication-I/O traces of an application's execution, it couples accurate automatic phase identification with statistical regeneration of event parameters to create compact, portable, and to some degree reconfigurable parallel application benchmarks. Experiments with four NAS Parallel Benchmarks (NPB) and three real scientific simulation codes confirm the fidelity of APPrime benchmarks. They retain the original applications' performance characteristics, in particular their relative performance across platforms. Also, the result benchmarks, already released online, are much more compact and easy-to-port compared to the original applications.
- APPrime Website. http://www.apprimecodes.org/.Google Scholar
- DOE INCITE. http://www.doeleadershipcomputing.org/awards/.Google Scholar
- OLCF Titan. https://www.olcf.ornl.gov/titan/.Google Scholar
- H. Abbasi, M. Wolf, G. Eisenhauer, S. Klasky, K. Schwan, and F. Zheng. DataStager: Scalable Data Staging Services for Petascale Applications. In Cluster Computing, 2010. Google ScholarDigital Library
- H. Adalsteinsson, S. Cranford, D. A. Evensky, J. P. Kenny, J. Mayo, A. Pinar, and C. L. Janssen. A Simulator for Large-Scale Parallel Computer Architectures. In International Journal of Distributed Systems and Technologies, 2010. Google ScholarDigital Library
- L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. Hpctoolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 2010. Google ScholarDigital Library
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. The nas parallel benchmark: Summary and preliminary results. In Supercomputing, pages 158--165, New York, NY, USA, 1991. Google ScholarDigital Library
- H. Brunst, H.-C. Hoppe, W. E. Nagel, and M. Winkler. Performance Optimization for Large Scale Computing: The Scalable VAMPIR Approach. In International Conference on Computational Science, 2001. Google ScholarDigital Library
- M. Casas, R. M. Badia, and J. Labarta. Automatic Phase Detection and Structure Extraction of MPI Applications. International Journal of High Performance Computing Applications, 2010. Google ScholarDigital Library
- D. P. Doane. Aesthetic Frequency Classifications. The American Statistician, 1976.Google Scholar
- J. Dujmović. Automatic Generation of Benchmark and Test Workloads. In WOSP/SIPEW, 2010. Google ScholarDigital Library
- E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M. Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine, R. H. Castain, D. J. Daniel, R. L. Graham, and T. S. Woodall. Open MPI: Goals, concept, and design of a next generation MPI implementation. In 11th European PVM/MPI Users' Group Meeting, pages 97--104, Budapest, Hungary, September 2004.Google ScholarCross Ref
- M. Geimer, F. Wolf, B. J. N. Wylie, E. Ábrahám, D. Becker, and B. Mohr. The Scalasca Performance Toolset Architecture. In Concurrency and Computation: Practice and Experience, 2010. Google ScholarDigital Library
- GTC2link. GTC-benchmark in NERSC-8 suite, 2013.Google Scholar
- C. L. Janssen, H. Adalsteinsson, and J. P. Kenny. Using Simulation to Design Extremescale Applications and Architectures: Programming Model Exploration. ACM SIGMETRICS, 2011. Google ScholarDigital Library
- A. M. Joshi, L. Eeckhout, and L. K. John. The Return of Synthetic Benchmarks. In SPEC Benchmark Workshop, 2008.Google Scholar
- J. P. Kenny, G. Hendry, B. Allan, and D. Zhang. Dumpi: The mpi profiler from the sst simulator suite. https://bitbucket.org/jpkenny/dumpi, 2011.Google Scholar
- A. Knupfer, R. Brendel, H. Brunst, H. Mix, and W. Nagel. Introducing the Open Trace Format (OTF). In V. Alexandrov, G. Albada, P. Sloot, and J. Dongarra, editors, International Conference on Computational Science, 2006. Google ScholarDigital Library
- R. Latham, C. Daley, W. keng Liao, K. Gao, R. Ross, A. Dubey, and A. Choudhary. A case study for scientific i/o: improving the flash astrophysics code. CSD, 5(1):015001, 2012.Google Scholar
- J. Logan, S. Klasky, H. Abbasi, Q. Liu, G. Ostrouchov, M. Parashar, N. Podhorszki, Y. Tian, and M. Wolf. Understanding I/O Performance Using I/O Skeletal Applications. In Euro-Par, 2012. Google ScholarDigital Library
- M. Noeth, F. Mueller, M. Schulz, and B. de Supinski. Scalable compression and replay of communication traces in massively parallel environments. In IPDPS, 2007.Google Scholar
- M. Noeth, P. Ratn, F. Mueller, M. Schulz, and B. R. de Supinski. ScalaTrace: Scalable Compression and Replay of Communication Traces for High-Performance Computing. J. Parallel Distrib. Comput., 2009. Google ScholarDigital Library
- F. Pachet, P. Roy, and G. Barbieri. Finite-length Markov Processes with Constraints. In IJCAI, 2011. Google ScholarDigital Library
- M. Pedersoli, A. Vedaldi, and J. Gonzalez. A coarse-to-fine approach for fast deformable object detection. In Computer Vision and Pattern Recognition, pages 1353--1360, 2011. Google ScholarDigital Library
- L. R. Rabiner and B. H. Juang. An introduction to hidden Markov models. ASSP Magazine, pages 4--15, January 1986.Google ScholarCross Ref
- S. Ku, C. S. Chang, and P. H. Diamond. Full-f Gyrokinetic Particle Simulation of Centrally Heated Global ITG Turbulence from Magnetic Axis to Edge Pedestal Top in A Realistic Tokamak Geometry. Nuclear Fusion, 2009.Google Scholar
- M. Seltzer, D. Krinsky, K. Smith, and X. Zhang. The Case for Application-Specific Benchmarking. In Hot Topics in Operating Systems, 1999. Google ScholarDigital Library
- S. Shao, A. K. Jones, and R. Melhem. A Compiler-based Communication Analysis Approach for Multiprocessor Systems. In IPDPS, 2006. Google ScholarDigital Library
- S. Shende and A. D. Malony. TAU: The tau parallel performance system. International Journal of High Performance Computing Applications, 20(2), 2006. Google ScholarDigital Library
- H. A. Sturges. The Choice of a Class Interval. Journal of the American Statistical Association, 1926.Google ScholarCross Ref
- G. Vahala, M. Soe, B. Zhang, J. Yepez, L. Vahala, J. Carter, and S. Ziegeler. Unitary Qubit Lattice Simulations of Multiscale Phenomena in Quantum Turbulence. In Supercomputing, 2011. Google ScholarDigital Library
- L. Van Ertvelde and L. Eeckhout. Dispersing Proprietary Applications as Benchmarks Through Code Mutation. ACM SIGOPS OSR, 2008. Google ScholarDigital Library
- J. Vetter and F. Mueller. Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In IPDPS, 2002. Google ScholarDigital Library
- J. S. Vetter and M. O. McCracken. Statistical Scalability Analysis of Communication Operations in Distributed Applications. ACM SIGPLAN, 2001. Google ScholarDigital Library
- W. X. Wang and Z. Lin and W. M. Tang and W. W. Lee and S. Ethier and J. L. V. Lewandowski and G. Rewoldt and T. S. Hahm and J. Manickam. Gyro-kinetic Simulation of Global Turbulent Transport Properties in Tokamak Experiments. Physics of Plasmas, 2006.Google Scholar
- X. Wu, V. Deshpande, and F. Mueller. ScalaBenchGen: Auto-Generation of Communication Benchmarks Traces. In IPDPS, 2012. Google ScholarDigital Library
- X. Wu and F. Mueller. ScalaExtrap: Trace-based Communication Extrapolation for SPMD Programs. In ACM PPoPP, 2011. Google ScholarDigital Library
- X. Wu, K. Vijayakumar, F. Mueller, X. Ma, and P. Roth. Probabilistic communication and i/o tracing with deterministic replay at scale. In ICPP, 2011. Google ScholarDigital Library
- Q. Xu and J. Subhlok. Construction and Evaluation of Coordinated Performance Skeletons. In HiPC, 2008. Google ScholarDigital Library
- Q. Xu, J. Subhlok, R. Zheng, and S. Voss. Logicalization of Communication Traces from Parallel Execution. In IISWC, 2009. Google ScholarDigital Library
- L. T. Yang, X. Ma, and F. Mueller. Cross-Platform Performance Prediction of Parallel Applications Using Partial Execution. In Supercomputing, 2005. Google ScholarDigital Library
- F. Yu, M. Alkhalaf, and T. Bultan. Stranger: An Automata-Based String Analysis Tool for PHP. In J. Esparza and R. Majumdar, editors, Lecture Notes in Computer Science. 2010. Google ScholarDigital Library
- J. Zhai, J. Hu, X. Tang, X. Ma, and W. Chen. Cypress: Combining static and dynamic analysis for top-down communication trace compression. In Supercomputing, 2014. Google ScholarDigital Library
- J. Zhai, T. Sheng, J. He, W. Chen, and W. Zheng. FACT: Fast Communication Trace Collection for Parallel Applications Through Program Slicing. In Supercomputing, 2009. Google ScholarDigital Library
Index Terms
- Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation
Recommendations
Auto-generation of communication benchmark traces
Benchmarks are essential for evaluating HPC hardware and software for petascale machines and beyond. But benchmark creation is a tedious manual process. As a result, benchmarks tend to lag behind the development of complex scientific codes. Our work ...
Combining Phase Identification and Statistic Modeling for Automated Parallel Benchmark Generation
Performance evaluation reviewParallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel ...
Combining phase identification and statistic modeling for automated parallel benchmark generation
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingParallel application benchmarks are indispensable for evaluating/optimizing HPC software and hardware. However, it is very challenging and costly to obtain high-fidelity benchmarks reflecting the scale and complexity of state-of-the-art parallel ...
Comments