ABSTRACT
GPUs are increasingly adopted for large-scale database processing, where data accesses represent the major part of the computation. If the data accesses are irregular, like hash table accesses or random sampling, the GPU performance can suffer. Especially when scaling such accesses beyond 2GB of data, a performance decrease of an order of magnitude is encountered. This paper analyzes the source of the slowdown through extensive micro-benchmarking, attributing the root cause to the Translation Lookaside Buffer (TLB). Using the micro-benchmarks, the TLB hierarchy and structure are fully analyzed on two different GPU architectures, identifying never-before-published TLB sizes that can be used for efficient large-scale application tuning. Based on the gained knowledge, we propose a TLB-conscious approach to mitigate the slowdown for algorithms with irregular memory access. The proposed approach is applied to two fundamental database operations - random sampling and hash-based grouping - showing that the slowdown can be dramatically reduced, and resulting in a performance increase of up to 13×.
- Austin Appleby. 2008. Murmurhash project. (2008). http://code.google.com/p/smhasher/.Google Scholar
- Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment 6, 9 (2013), 709--720. Google ScholarDigital Library
- Jeremy Appleyard. 2016. Nvidia Presentation: PASCAL AND CUDA 8.0.Google Scholar
- Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU Join Processing Revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware (DaMoN '12). ACM, New York, NY, USA, 55--62. Google ScholarDigital Library
- Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. PVLDB, Vol. 10, No. 7 (2017).Google ScholarDigital Library
- Tomas Karnagel, Rene Mueller, and Guy M. Lohman. 2015. Optimizing GPU-accelerated Group-By and Aggregation. In ADMS at VLDB.Google Scholar
- Donald E. Knuth. 1997. The Art of Computer Programming, Volume 2 (3rd Ed.): Seminumerical Algorithms. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.Google Scholar
- Stefan Manegold. 2002. Understanding, modeling, and improving main-memory database performance.Google Scholar
- Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. 2000. Optimizing Database Architecture for the New Bottleneck: Memory Access. The VLDB Journal 9, 3 (Dec. 2000), 231--246. Google ScholarDigital Library
- Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 72--86. Google ScholarDigital Library
- Sparsh Mittal. 2016. A survey of techniques for architecting TLBs. Concurrency and Computation: Practice and Experience (2016).Google Scholar
- Todd Mostak. 2013. An Overview of MapD (Massively Parallel Database). White Paper (2013).Google Scholar
- Misel myrto Papadopoulou, Maryam Sadooghi-alvandi, and Henry Wong. 2009. Micro-benchmarking the GT200 GPU. Technical Report.Google Scholar
- NVIDIA 2015. CUDA C Programming Guide (7.0 ed.). NVIDIA.Google Scholar
- NVIDIA. 2015. TESLA K80 GPU Accelerator - Board Specification (BD-07317-001_v05 ed.).Google Scholar
- NVIDIA. 2016. NVIDIA Tesla P100 - The Most Advanced Datacenter Accelerator Ever Built Featuring Pascal GP100, the World's Fastest GPU (WP-08019-001_v01.1 ed.).Google Scholar
Recommendations
DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems
Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...
Filtering Translation Bandwidth with Virtual Caching
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating SystemsHeterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory ...
Filtering Translation Bandwidth with Virtual Caching
ASPLOS '18Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory ...
Comments