ABSTRACT
The computing requirements of scientific applications have influenced processor design, and have motivated the introduction and use of many-core processors, i.e., accelerators, for high performance computing (HPC). Consequently, it is now common for the compute nodes of HPC clusters to be comprised of multiple computing devices, including accelerators. Although execution time can be used to compare the performance of different computing devices, there exists no standard way to analyze application performance across devices with very different architectural designs and, thus, understand why one outperforms another. Without this knowledge, a developer is handicapped when attempting to effectively tune application performance, as is a hardware designer when trying to understand how best to improve the design of computing devices. In this paper, we use the LULESH 1.0 proxy application to compare and analyze the performance of three different accelerators: the Intel® Xeon Phi™ and the NVIDIA Fermi and Kepler GPUs. Our study shows that LULESH 1.0 exhibits similar execution-time behavior across the three accelerators, but runs up to 7X faster on the Kepler. Despite the significant architectural differences between the Xeon Phi™ and the GPUs, and the differences in the metrics used to characterize their performance, we were able to quantify why the Kepler outperforms both the Fermi and the Xeon Phi™. To do this, we compared their achieved instructions per cycle and vectorization usage, as well as their memory behavior and power and energy consumption.
- S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron. Rodinia: A Benchmark Suite for Heterogeneous Computing. In Proc. of the 2009 IEEE Int. Symp. on Workload Characterization, IISWC '09, pages 44--54, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
- A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter. The Scalable Heterogeneous Computing (SHOC) Benchmark Suite. In Proc. of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units, GPGPU-3, pages 63--74, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- E. Gallardo. A Case Study of Accelerator Performance. Master's thesis, University of Texas at El Paso, El Paso, TX, 2015.Google Scholar
- J. Jeffers and J. Reinders. Intel Xeon Phi Coprocessor High-Performance Programming. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1st edition, 2013. Google ScholarDigital Library
- I. Karlin, A. Bhatele, B. L. Chamberlain, J. Cohen, Z. Devito, M. Gokhale, R. Haque, R. Hornung, J. Keasler, D. Laney, E. Luke, S. Lloyd, J. McGraw, R. Neely, D. Richards, M. Schulz, C. H. Still, F. Wang, and D. Wong. LULESH Programming Model and Performance Ports Overview. Technical Report LLNL-TR-608824, December 2012.Google ScholarCross Ref
- K. Krommydas, W. C. Feng, C. D. Antonopoulos, and N. Bellas. OpenDwarfs: Characterization of Dwarf-Based Benchmarks on Fixed and Reconfigurable Architectures. Journal of Signal Processing Systems, pages 1--20, 2015. Google ScholarDigital Library
- K. Krommydas, T. Scogland, and W. C. Feng. On the Programmability and Performance of Heterogeneous Platforms. In Proc. of the 19th IEEE Int. Conf. on Parallel and Distributed Systems, ICPADS, pages 224--231, Washington, DC, USA, 2013. IEEE Computer Society. Google ScholarDigital Library
- J. LaGrone, A. Aribuki, and B. Chapman. A Set of Microbenchmarks for Measuring OpenMP Task Overheads. In Proc. of the Int. Conf. on Parallel and Distributed Processing Techniques and Applications, pages 594--600. Citeseer, 2011.Google Scholar
- B. Li, H. C. Chang, S. Song, C. Y. Su, T. Meyer, J. Mooring, and K. W. Cameron. The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications. In Proc. of the 2014 IEEE Int. Parallel & Distributed Processing Symp. Workshops, IPDPSW '14, pages 1448--1456, Washington, DC, USA, 2014. IEEE Computer Socieity. Google ScholarDigital Library
- P. J. Mucci, S. Browne, C. Deane, and G. Ho. PAPI: A Portable Interface to Hardware Performance Counters. In Proc. of the Dept. of Defense HPCMP Users Group Conf., pages 7--10, 1999.Google Scholar
- S. Muralidharan, K. O'Brien, and C. Lalanne. A Semi-Automated Tool Flow for Roofline Analysis of OpenCL Kernels on Accelerators. In Proc. of the 1st Int. Workshop on Heterogeneous High-performance Reconfigurable Computing, H2RC'15, 2015.Google Scholar
- NVIDIA. NVIDIA System Management Interface. Retrieved from: https://developer.nvidia.com/nvidia-system-management-interface.Google Scholar
- NVIDIA. NVIDIA Visual Profiler. Retrieved from: https://developer.nvidia.com/nvidia-visual-profiler.Google Scholar
- V. V. Stegailov, N. D. Orekhov, and G. S. Smirnov. HPC Hardware Efficiency for Quantum and Classical Molecular Dynamics. In Parallel Computing Technologies, pages 469--473. Springer, 2015. Google ScholarDigital Library
- Cross-Accelerator Performance Profiling
Recommendations
Modeling and predicting performance of high performance computing applications on hardware accelerators
Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. ...
Modeling and Predicting Performance of High Performance Computing Applications on Hardware Accelerators
IPDPSW '12: Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD ForumComputers with hardware accelerators, also referred to as hybrid-core systems, speedup applications by offloading certain compute operations that can run faster on accelerators. Thus, it is not surprising that many of top500 supercomputers use ...
An OpenACC-based unified programming model for multi-accelerator systems
PPoPP 2015: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingThis paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based on OpenACC. It allows programmers ...
Comments