research-article

Big data causing big (TLB) problems: taming random memory accesses on the GPU

Authors:
Tomas Karnagel

Technische Universität Dresden, Germany

Technische Universität Dresden, Germany
View Profile

,
Tal Ben-Nun

Hebrew University of Jerusalem, Israel

Hebrew University of Jerusalem, Israel
View Profile

,
Matthias Werner

Technische Universität Dresden, Germany

Technische Universität Dresden, Germany
View Profile

,
Dirk Habich

Technische Universität Dresden, Germany

Technische Universität Dresden, Germany
View Profile

,
Wolfgang Lehner

Technische Universität Dresden, Germany

Technische Universität Dresden, Germany
View Profile

DAMON '17: Proceedings of the 13th International Workshop on Data Management on New HardwareMay 2017Article No.: 6Pages 1–10https://doi.org/10.1145/3076113.3076115

Published:14 May 2017Publication History

DAMON '17: Proceedings of the 13th International Workshop on Data Management on New Hardware

Pages 1–10

ABSTRACT

GPUs are increasingly adopted for large-scale database processing, where data accesses represent the major part of the computation. If the data accesses are irregular, like hash table accesses or random sampling, the GPU performance can suffer. Especially when scaling such accesses beyond 2GB of data, a performance decrease of an order of magnitude is encountered. This paper analyzes the source of the slowdown through extensive micro-benchmarking, attributing the root cause to the Translation Lookaside Buffer (TLB). Using the micro-benchmarks, the TLB hierarchy and structure are fully analyzed on two different GPU architectures, identifying never-before-published TLB sizes that can be used for efficient large-scale application tuning. Based on the gained knowledge, we propose a TLB-conscious approach to mitigate the slowdown for algorithms with irregular memory access. The proposed approach is applied to two fundamental database operations - random sampling and hash-based grouping - showing that the slowdown can be dramatically reduced, and resulting in a performance increase of up to 13×.

References

Austin Appleby. 2008. Murmurhash project. (2008). http://code.google.com/p/smhasher/.Google Scholar
Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-oblivious parallelism for in-memory column-stores. Proceedings of the VLDB Endowment 6, 9 (2013), 709--720. Google ScholarDigital Library
Jeremy Appleyard. 2016. Nvidia Presentation: PASCAL AND CUDA 8.0.Google Scholar
Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU Join Processing Revisited. In Proceedings of the Eighth International Workshop on Data Management on New Hardware (DaMoN '12). ACM, New York, NY, USA, 55--62. Google ScholarDigital Library
Tomas Karnagel, Dirk Habich, and Wolfgang Lehner. 2017. Adaptive Work Placement for Query Processing on Heterogeneous Computing Resources. PVLDB, Vol. 10, No. 7 (2017).Google ScholarDigital Library
Tomas Karnagel, Rene Mueller, and Guy M. Lohman. 2015. Optimizing GPU-accelerated Group-By and Aggregation. In ADMS at VLDB.Google Scholar
Donald E. Knuth. 1997. The Art of Computer Programming, Volume 2 (3rd Ed.): Seminumerical Algorithms. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.Google Scholar
Stefan Manegold. 2002. Understanding, modeling, and improving main-memory database performance.Google Scholar
Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. 2000. Optimizing Database Architecture for the New Bottleneck: Memory Access. The VLDB Journal 9, 3 (Dec. 2000), 231--246. Google ScholarDigital Library
Xinxin Mei and Xiaowen Chu. 2017. Dissecting GPU memory hierarchy through microbenchmarking. IEEE Transactions on Parallel and Distributed Systems 28, 1 (2017), 72--86. Google ScholarDigital Library
Sparsh Mittal. 2016. A survey of techniques for architecting TLBs. Concurrency and Computation: Practice and Experience (2016).Google Scholar
Todd Mostak. 2013. An Overview of MapD (Massively Parallel Database). White Paper (2013).Google Scholar
Misel myrto Papadopoulou, Maryam Sadooghi-alvandi, and Henry Wong. 2009. Micro-benchmarking the GT200 GPU. Technical Report.Google Scholar
NVIDIA 2015. CUDA C Programming Guide (7.0 ed.). NVIDIA.Google Scholar
NVIDIA. 2015. TESLA K80 GPU Accelerator - Board Specification (BD-07317-001_v05 ed.).Google Scholar
NVIDIA. 2016. NVIDIA Tesla P100 - The Most Advanced Datacenter Accelerator Ever Built Featuring Pascal GP100, the World's Fastest GPU (WP-08019-001_v01.1 ed.).Google Scholar

Recommendations

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Conventional on-chip TLB hierarchies are unable to fully cover the growing application working-set sizes. To make things worse, Last-Level TLB (LLT) misses require multiple accesses to the page table even with the use of page walk caches. Consequently, ...
Read More
Filtering Translation Bandwidth with Virtual Caching
ASPLOS '18: Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory ...
Read More
Filtering Translation Bandwidth with Virtual Caching
ASPLOS '18

Heterogeneous computing with GPUs integrated on the same chip as CPUs is ubiquitous, and to increase programmability many of these systems support virtual address accesses from GPU hardware. However, this entails address translation on every memory ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DAMON '17: Proceedings of the 13th International Workshop on Data Management on New Hardware
May 2017
70 pages
ISBN:9781450350259
DOI:10.1145/3076113

Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 14 May 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
GPU
TLB
grouping
random memory access
virtual memory
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate80of102submissions,78%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 692
  Total Downloads
- Downloads (Last 12 months)84
- Downloads (Last 6 weeks)17
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Big data causing big (TLB) problems: taming random memory accesses on the GPU

DAMON '17: Proceedings of the 13th International Workshop on Data Management on New Hardware

ABSTRACT

References

Cited By

Recommendations

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Filtering Translation Bandwidth with Virtual Caching

Filtering Translation Bandwidth with Virtual Caching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Big data causing big (TLB) problems: taming random memory accesses on the GPU

DAMON '17: Proceedings of the 13th International Workshop on Data Management on New Hardware

ABSTRACT

References

Cited By

Recommendations

DUCATI: High-performance Address Translation by Extending TLB Reach of GPU-accelerated Systems

Filtering Translation Bandwidth with Virtual Caching

Filtering Translation Bandwidth with Virtual Caching

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media