skip to main content
10.1145/3076113.3076115acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Big data causing big (TLB) problems: taming random memory accesses on the GPU

Published:14 May 2017Publication History

ABSTRACT

GPUs are increasingly adopted for large-scale database processing, where data accesses represent the major part of the computation. If the data accesses are irregular, like hash table accesses or random sampling, the GPU performance can suffer. Especially when scaling such accesses beyond 2GB of data, a performance decrease of an order of magnitude is encountered. This paper analyzes the source of the slowdown through extensive micro-benchmarking, attributing the root cause to the Translation Lookaside Buffer (TLB). Using the micro-benchmarks, the TLB hierarchy and structure are fully analyzed on two different GPU architectures, identifying never-before-published TLB sizes that can be used for efficient large-scale application tuning. Based on the gained knowledge, we propose a TLB-conscious approach to mitigate the slowdown for algorithms with irregular memory access. The proposed approach is applied to two fundamental database operations - random sampling and hash-based grouping - showing that the slowdown can be dramatically reduced, and resulting in a performance increase of up to 13×.

References

  1. Austin Appleby. 2008. Murmurhash project. (2008). http://code.google.com/p/smhasher/.Google ScholarGoogle Scholar
  2. Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment 6, 9 (2013), 709--720. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Jeremy Appleyard. 2016. Nvidia Presentation: PASCAL AND CUDA 8.0.Google ScholarGoogle Scholar
  4. Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU Join Processing Revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware (DaMoN '12). ACM, New York, NY, USA, 55--62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. PVLDB, Vol. 10, No. 7 (2017).Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Tomas Karnagel, Rene Mueller, and Guy M. Lohman. 2015. Optimizing GPU-accelerated Group-By and Aggregation. In ADMS at VLDB.Google ScholarGoogle Scholar
  7. Donald E. Knuth. 1997. The Art of Computer Programming, Volume 2 (3rd Ed.): Seminumerical Algorithms. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.Google ScholarGoogle Scholar
  8. Stefan Manegold. 2002. Understanding, modeling, and improving main-memory database performance.Google ScholarGoogle Scholar
  9. Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. 2000. Optimizing Database Architecture for the New Bottleneck: Memory Access. The VLDB Journal 9, 3 (Dec. 2000), 231--246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 72--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Sparsh Mittal. 2016. A survey of techniques for architecting TLBs. Concurrency and Computation: Practice and Experience (2016).Google ScholarGoogle Scholar
  12. Todd Mostak. 2013. An Overview of MapD (Massively Parallel Database). White Paper (2013).Google ScholarGoogle Scholar
  13. Misel myrto Papadopoulou, Maryam Sadooghi-alvandi, and Henry Wong. 2009. Micro-benchmarking the GT200 GPU. Technical Report.Google ScholarGoogle Scholar
  14. NVIDIA 2015. CUDA C Programming Guide (7.0 ed.). NVIDIA.Google ScholarGoogle Scholar
  15. NVIDIA. 2015. TESLA K80 GPU Accelerator - Board Specification (BD-07317-001_v05 ed.).Google ScholarGoogle Scholar
  16. NVIDIA. 2016. NVIDIA Tesla P100 - The Most Advanced Datacenter Accelerator Ever Built Featuring Pascal GP100, the World's Fastest GPU (WP-08019-001_v01.1 ed.).Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    DAMON '17: Proceedings of the 13th International Workshop on Data Management on New Hardware
    May 2017
    70 pages
    ISBN:9781450350259
    DOI:10.1145/3076113

    Copyright © 2017 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 14 May 2017

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

    Acceptance Rates

    Overall Acceptance Rate80of102submissions,78%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader