ABSTRACT
Accelerators are adopted to increase performance, reduce time-to-solution, and minimize energy-to-solution. However, employing them efficiently, given system and application characteristics, is often a daunting task. A goal of this work is to propose a general model that predicts performance and power requirements for an application, computational portions of which are offloaded to an accelerator. Intel Xeon Phi is the only accelerator type investigated here, and only in offload execution mode. This mode is also employed by other accelerator types, such as GPU; thus the proposed model is applicable directly. The predictive capabilities of the model are demonstrated by determining the best hardware-software configuration instances with respect to the minimum energy consumption for the CoMD proxy application executed on single or multiple nodes. For the CoMD problem sizes investigated here, the best modeled configuration was relatively close to the best measured configuration with relative error under 5% of the energy consumed for most configurations. Initial model validation also confirmed the model accuracy for a variety of model parameters, such as host computation time and power consumption on the host and accelerator. The model also provides estimates of the total data movement and computational throughput as well as of some key metrics, such as FLOPs-per-joule and bytes-per-joule, which are commonly used to study the energy-performance trade-offs.
- S. Cepeda. Optimization and performance tuning for Intel Xeon Phi coprocessors, part 2: Understanding and using hardware events, 2012. https://software.intel.com/en-us/articles/.Google Scholar
- J. Choi, M. Mukhan, X. Liu, and R. Vudue. Algorithmic time, energy, and power on candidate HPC compute building blocks. In 2014 IEEE 28th International Symposium on Parallel Distributed Processing (IPDPS), Arizona, USA, May 2014. Google ScholarDigital Library
- J. W. Choi, D. Bedard, R. Fowler, and R. Vuduc. A roofline model of energy. In Parallel Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on, pages 661--672, May 2013. Google ScholarDigital Library
- K. Choi, R. Soma, and M. Pedram. Fine-grained dynamic voltage and frequency scaling for precise energy and performance tradeoff based on the ratio of off-chip access to on-chip computation times. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, Jan 2005. Google ScholarDigital Library
- M. Corden. How to compile for Intel AVX, 2012. https://software.intel.com/en-us/articles/how-to-compile-for-intel-avx.Google Scholar
- DOE. Co-design, 2013. http://science.energy.gov/ascr/research/scidac/co-design/.Google Scholar
- ExMatEx. CoMD proxy application, 2012. http://www.exmatex.org/comd.html.Google Scholar
- R. Hayashi and S. Horiguchi. Domain decomposition scheme for parallel molecular dynamics simulation. In High Performance Computing on the Information Superhighway, 1997. HPC Asia '97, pages 595--600, Apr 1997. Google ScholarDigital Library
- ICL:UT. Performance application programming interface PAPI, 2015. http://icl.cs.utk.edu/papi/.Google Scholar
- Intel. How to use huge pages to improve application performance on pIntel Xeon Phi coprocessor, 2012. https://software.intel.com/sites/default/files/Large_pages_mic_0.pdf.Google Scholar
- G. Lawson, M. Sosonkina, and Yuzhong S. Energy evaluation for applications with different thread affinities on the Intel Xeon Phi. In Computer Architecture and High Performance Computing Workshop (SBAC-PADW), 2014 International Symposium on, Oct 2014. Google ScholarDigital Library
- G. Lawson, M. Sosonkina, and Y. Shen. Performance and energy evaluation of CoMD on Intel Xeon Phi co-processors. In Proceedings of the 1st International Workshop on Hardware-Software Co-Design for High Performance Computing, Co-HPC '14, Piscataway, NJ, USA, 2014. IEEE Press. http://dx.doi.org/10.1109/Co-HPC.2014.12. Google ScholarDigital Library
- G. Lawson, M. Sosonkina, and Y. Shen. Changing CPU frequency in CoMD proxy application offloaded to Intel Xeon Phi co-processors. Procedia Computer Science, 51(0):100--109, 2015. International Conference On Computational Science, ICCS 2015.Google ScholarDigital Library
- G. Lawson, M. Sosonkina, and Y. Shen. Towards modeling energy consumption of Xeon Phi. CoRR, abs/1505.06539, 2015. http://arxiv.org/abs/1505.06539.Google Scholar
- G. Lawson, V. Sundriyal, M. Sosonkina, and Y. Shen. Experimentation procedure for offloaded mini-apps executed on cluster architectures with Xeon Phi accelerators, 2015. http://arxiv.org/abs/1509.02135.Google Scholar
- B. Li, H. Chang, S. L. Song, C. Su, T. Meyer, J. Mooring, and K. Cameron. The power-performance tradeoffs of the Intel Xeon Phi on HPC applications, 2014. http://scape.cs.vt.edu/wp-content/uploads/2014/06/lspp14-Li.pdf.Google ScholarDigital Library
- J. Mohd-Yusof, S. Swaminarayan, and T. C. Germann. Co-design for molecular dynamics: An exascale proxy application, 2013. http://www.lanl.gov/orgs/adtsc/publications/science_highlights_2013/docs/Pg88_89.pdf.Google Scholar
- Y. S. Shao and D. Brooks. Energy characterization and instruction-level energy model of Intel's Xeon Phi processor, 2013. http://www.eecs.harvard.edu/~shao/papers/shao2013-islped.pdf.Google Scholar
- V. Sundriyal and M. Sosonkina. Analytical modeling of the CPU frequency to minimize energy consumption in parallel applications. Submitted for publication to: Elsevier, 2015.Google Scholar
- S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, April 2009. http://doi.acm.org/10.1145/1498765.1498785. Google ScholarDigital Library
Index Terms
- Modeling performance and energy for applications offloaded to Intel Xeon Phi
Recommendations
Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumThe Intel® Xeon Phi™ coprocessor platform has a new software stack that enables new programming models. One such model is offload of computation from a host processor to a coprocessor that is a fully-capable Intel® Architecture CPU, namely, the Intel® ...
Energy and Power Characterization of Parallel Programs Running on Intel Xeon Phi
ICPPW '14: Proceedings of the 2014 43rd International Conference on Parallel Processing WorkshopsIntel's Xeon Phi coprocessor has successfully proved its capability by being used in Tianhe-2 and Stampede, two of the top ten most powerful supercomputers today. It is almost certain that the popularity of Xeon Phi in heterogeneous computing will grow ...
Direct MPI Library for Intel Xeon Phi Co-Processors
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD ForumDCFA-MPI is an MPI library implementation for Intel Xeon Phi co-processor clusters, where a compute node consists of an Intel Xeon Phi co-processor card connected to the host via PCI Express with InfiniBand. DCFA-MPI enables direct data transfer between ...
Comments