ABSTRACT
The multi-zone scalar pentadiagonal (SP-MZ) benchmark, part of the multi-zone NAS Parallel Benchmark suite, is ported to graphics processing units (GPUs) using OpenACC compiler directives. The sequence of optimizations necessary to transform the SP-MZ algorithm from CPU-oriented to GPU-oriented is presented. The performance of the OpenACC implementation on GPUs is measured using predefined mesh sizes. We observe a 30% speed-up using the OpenACC implement on an NVIDIA Kepler K40 GPU compared to an eight-core Intel Xeon E5-2670 CPU with the small Class-A mesh (256 thousand points). Setting inter-zone boundary conditions directly on the device reduced run-time by 22% due to the high cost of host-device communication. Multi-device benchmarks with the larger Class-C mesh (4.3 million points) were scaled to 32 GPU nodes and matched or outperformed the CPU baseline with ten cores per node. Combining both CPU and GPU computing power improved the throughput on the Class-C mesh by 75%. We define a larger zone size with one million points per node to better reflect modern usage with codes similar to SP-MZ. The OpenACC GPU implementation outperformed the baseline multi-core CPU by 29% on this real-world mesh size.
- Van der Wijngaart, R. F., Haoqiang, J., "NASA Parallel Benchmarks, Multi-Zone Versions," NAS Technical Report NAS-03-010, July, 2003.Google Scholar
- Buning, P., Parks, S., Chan, W., and Renze, K., "Application of the Chimera Overlapped Grid Scheme to Simulation of Space Shuttle Ascent Flows," Proceedings of the 4th International Symposium on Computational Fluid Dynamics, Vol. 1, 1991, pp. 132--137.Google Scholar
- Visbal, M. and Gaitonde, D., "On the Use of Higher-Order Finite-Difference Schemes on Curvilinear and Deforming Meshes," J. of Computational Physics, Vol. 181(1), pp. 155--185, 2002. Google ScholarDigital Library
- Xu, R., Tian, X., Chandrasekaran, S., Yan, Y., and Chapman, B., "OpenACC Parallelization and optimization of NAS parallel benchmarks," GPU Technology Conference 2014.Google Scholar
- www.openacc.org, accessed on July 28, 2015.Google Scholar
- www.nvidia.com/object/cuda_home_new.html, accessed on July 27, 2015.Google Scholar
- Y. Zhang, J. Cohen, J. D. Owens, "Fast tridiagonal solvers on the GPU," ACM Sigplan Notices, 45 (2010) 127--136. Google ScholarDigital Library
- C. P. Stone, E. P. Duque, Y. Zhang, D. Car, J. D. Owens, R. L. Davis, "GPGPU parallel algorithms for structured-grid CFD codes," AIAA paper, 2011-3221, 2011.Google Scholar
Index Terms
- Accelerating the multi-zone scalar pentadiagonal CFD algorithm with OpenACC
Recommendations
Benchmarking OpenCL, OpenACC, OpenMP, and CUDA: Programming Productivity, Performance, and Energy Consumption
ARMS-CC '17: Proceedings of the 2017 Workshop on Adaptive Resource Management and Scheduling for Cloud ComputingMany modern parallel computing systems are heterogeneous at their node level. Such nodes may comprise general purpose CPUs and accelerators (such as, GPU, or Intel Xeon Phi) that provide high performance with suitable energy-consumption characteristics. ...
Accelerating financial applications on the GPU
GPGPU-6: Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing UnitsThe QuantLib library is a popular library used for many areas of computational finance. In this work, the parallel processing power of the GPU is used to accelerate QuantLib financial applications. Black-Scholes, Monte-Carlo, Bonds, and Repo code paths ...
A preliminary evaluation of OpenACC implementations
During the last few years, the availability of hardware accelerators, such as GPUs, has rapidly increased. However, the entry cost to GPU programming is high and requires a considerable porting and tuning effort. Some research groups and vendors have ...
Comments