skip to main content
10.1145/2949550.2949567acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
research-article
Public Access

Cross-Accelerator Performance Profiling

Published:17 July 2016Publication History

ABSTRACT

The computing requirements of scientific applications have influenced processor design, and have motivated the introduction and use of many-core processors, i.e., accelerators, for high performance computing (HPC). Consequently, it is now common for the compute nodes of HPC clusters to be comprised of multiple computing devices, including accelerators. Although execution time can be used to compare the performance of different computing devices, there exists no standard way to analyze application performance across devices with very different architectural designs and, thus, understand why one outperforms another. Without this knowledge, a developer is handicapped when attempting to effectively tune application performance, as is a hardware designer when trying to understand how best to improve the design of computing devices. In this paper, we use the LULESH 1.0 proxy application to compare and analyze the performance of three different accelerators: the Intel® Xeon Phi™ and the NVIDIA Fermi and Kepler GPUs. Our study shows that LULESH 1.0 exhibits similar execution-time behavior across the three accelerators, but runs up to 7X faster on the Kepler. Despite the significant architectural differences between the Xeon Phi™ and the GPUs, and the differences in the metrics used to characterize their performance, we were able to quantify why the Kepler outperforms both the Fermi and the Xeon Phi™. To do this, we compared their achieved instructions per cycle and vectorization usage, as well as their memory behavior and power and energy consumption.

References

  1. S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proc. of the 2009 IEEE Int. Symp. on Workload Characterization, IISWC '09, pages 44--54, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proc. of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pages 63--74, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Gallardo. A Case Study of Accelerator Performance. Master's thesis, University of Texas at El Paso, El Paso, TX, 2015.Google ScholarGoogle Scholar
  4. J. Jeffers and J. Reinders. Intel Xeon Phi Coprocessor High-Performance Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. I. Karlin, A. Bhatele, B. L. Chamberlain, J. Cohen, Z. Devito, M. Gokhale, R. Haque, R. Hornung, J. Keasler, D. Laney, E. Luke, S. Lloyd, J. McGraw, R. Neely, D. Richards, M. Schulz, C. H. Still, F. Wang, and D. Wong. LULESH Programming Model and Performance Ports Overview. Technical Report LLNL-TR-608824, December 2012.Google ScholarGoogle ScholarCross RefCross Ref
  6. K. Krommydas, W. C. Feng, C. D. Antonopoulos, and N. Bellas. OpenDwarfs: Characterization of Dwarf-Based Benchmarks on Fixed and Reconfigurable Architectures. Journal of Signal Processing Systems, pages 1--20, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. K. Krommydas, T. Scogland, and W. C. Feng. On the Programmability and Performance of Heterogeneous Platforms. In Proc. of the 19th IEEE Int. Conf. on Parallel and Distributed Systems, ICPADS, pages 224--231, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. LaGrone, A. Aribuki, and B. Chapman. A Set of Microbenchmarks for Measuring OpenMP Task Overheads. In Proc. of the Int. Conf. on Parallel and Distributed Processing Techniques and Applications, pages 594--600. Citeseer, 2011.Google ScholarGoogle Scholar
  9. B. Li, H. C. Chang, S. Song, C. Y. Su, T. Meyer, J. Mooring, and K. W. Cameron. The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications. In Proc. of the 2014 IEEE Int. Parallel & Distributed Processing Symp. Workshops, IPDPSW '14, pages 1448--1456, Washington, DC, USA, 2014. IEEE Computer Socieity. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. J. Mucci, S. Browne, C. Deane, and G. Ho. PAPI: A Portable Interface to Hardware Performance Counters. In Proc. of the Dept. of Defense HPCMP Users Group Conf., pages 7--10, 1999.Google ScholarGoogle Scholar
  11. S. Muralidharan, K. O'Brien, and C. Lalanne. A Semi-Automated Tool Flow for Roofline Analysis of OpenCL Kernels on Accelerators. In Proc. of the 1st Int. Workshop on Heterogeneous High-performance Reconfigurable Computing, H2RC'15, 2015.Google ScholarGoogle Scholar
  12. NVIDIA. NVIDIA System Management Interface. Retrieved from: https://developer.nvidia.com/nvidia-system-management-interface.Google ScholarGoogle Scholar
  13. NVIDIA. NVIDIA Visual Profiler. Retrieved from: https://developer.nvidia.com/nvidia-visual-profiler.Google ScholarGoogle Scholar
  14. V. V. Stegailov, N. D. Orekhov, and G. S. Smirnov. HPC Hardware Efficiency for Quantum and Classical Molecular Dynamics. In Parallel Computing Technologies, pages 469--473. Springer, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Cross-Accelerator Performance Profiling

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          XSEDE16: Proceedings of the XSEDE16 Conference on Diversity, Big Data, and Science at Scale
          July 2016
          405 pages
          ISBN:9781450347556
          DOI:10.1145/2949550

          Copyright © 2016 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 July 2016

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate129of190submissions,68%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader