research-article

Public Access

Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU

Authors:

Xipeng ShenAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 14, Pages 1 - 13

https://doi.org/10.1145/2925426.2926277

Published: 01 June 2016 Publication History

Abstract

A Graphic Processing Unit (GPU) system is typically equipped with many types of memory (e.g., global, constant, texture, shared, cache). Data placement determines what data are placed on which type of memory, essential for GPU memory performance. Prior optimizations of data placement always require a single view of a data object on memory, which limits the optimization effectiveness. In this work, we propose coherence-free multiview, an approach that allows multiple views of a single data object to co-exist on GPU memory during a GPU kernel execution. We demonstrate that under certain conditions, the multiple views can remain incoherent while facilitating enhanced data placement. We present a theorem and some compiler support to ensure the soundness of the usage of coherence-free multiview. We further develop reference-discerning data placement, a new way to enhance data placements on GPU. It enables more flexible data placements by using coherence-free multiview to leverage the slack in coherence requirement of some GPU programs. Experiments on three types of GPU systems show that, with less than 200KB space cost, the new data placement technique can provide a 1.6X average (up to 4.27X) speedup.

References

[1]

"NVIDIA CUDA." http://www.nvidia.com/cuda.

[2]

G. Chen, B. Wu, D. Li, and X. Shen, "Porple: An extensible optimizer for portable data placement on gpu," in Proceedings of the 47th International Conference on Microarchitecture, 2014.

Digital Library

[3]

B. Jang, D. Schaa, P. Mistry, and D. Kaeli, "Exploiting memory access patterns to improve memory performance in data-parallel architectures," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 1, pp. 105--118, 2011.

Digital Library

[4]

A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The scalable heterogeneous computing (shoc) benchmark suite," in GPGPU, 2010.

Digital Library

[5]

M. Kulkarni, M. Burtscher, K. Pingali, and C. Cascaval, "Lonestar: A suite of parallel irregular programs," in Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, 2009.

[6]

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009.

Digital Library

[7]

R. Nasre, M. Burtscher, and K. Pingali, "Atomic-free irregular computations on gpus," in Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, 2013.

Digital Library

[8]

S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," in Innovative Parallel Computing (InPar), 2012, pp. 1--10, IEEE, 2012.

[9]

B. Wang, B. Wu, D. Li, X. Shen, W. Yu, Y. Jiao, and J. S. Vetter, "Exploring hybrid memory for gpu energy efficiency through software-hardware co-design," in Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, (Piscataway, NJ, USA), pp. 93--102, IEEE Press, 2013.

Digital Library

[10]

W. Ma and G. Agrawal, "An integer programming framework for optimizing shared memory use on gpus," in PACT, pp. 553--554, 2010.

Digital Library

[11]

D. Jevdjic, S. Volos, and B. Falsafi, "Die-stacked dram caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache.," in ISCA, pp. 404--415, 2013.

Digital Library

[12]

L. E. Ramos, E. Gorbatov, and R. Bianchini, "Page placement in hybrid memory systems," in Proceedings of the International Conference on Supercomputing, ICS '11, pp. 85--95, 2011.

Digital Library

[13]

M. K. Qureshi, V. Srinivasan, and J. A. Rivers, "Scalable high performance main memory system using phase-change memory technology," in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pp. 24--33, 2009.

Digital Library

[14]

A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson, "The design and implementation of a verification technique for gpu kernels," ACM Trans. Program. Lang. Syst., vol. 37, pp. 10:1--10:49, May 2015.

Digital Library

[15]

D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous-race-free memory models," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, (New York, NY, USA), pp. 427--440, ACM, 2014.

Digital Library

[16]

B. R. Gaster, D. Hower, and L. Howes, "Hrf-relaxed: Adapting hrf to the complexities of industrial heterogeneous memory models," ACM Trans. Archit. Code Optim., vol. 12, pp. 7:1--7:26, Apr. 2015.

Digital Library

[17]

J. Wickerson, M. Batty, B. M. Beckmann, and A. F. Donaldson, "Remote-scope promotion: Clarified, rectified, and verified," in Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, (New York, NY, USA), pp. 731--747, ACM, 2015.

Digital Library

[18]

S. Che, J. W. Sheaffer, and K. Skadron, "Dymaxion: Optimizing memory access patterns for heterogeneous systems," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pp. 13:1--13:11, 2011.

Digital Library

[19]

E. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen, "On-the-fly elimination of dynamic irregularities for gpu computing," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2011.

Digital Library

[20]

B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X. Shen, "Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu," in Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2013.

Digital Library

[21]

M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, "A compiler framework for optimization of affine loop nests for GPGPUs," in ICS'08: Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 225--234, 2008.

Digital Library

[22]

I.-J. Sung, J. A. Stratton, and W.-M. W. Hwu, "Data layout transformation exploiting memory-level parallelism in structured grid many-core applications," in Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pp. 513--522, 2010.

Digital Library

[23]

W. Jia, K. A. Shaw, and M. Martonosi, "Characterizing and improving the use of demand-fetched caches in gpus," in Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, 2012.

Digital Library

[24]

Y. Yang, P. Xiang, J. Kong, and H. Zhou, "A gpgpu compiler for memory optimization and parallelism management," in Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '10, pp. 86--97, 2010.

Digital Library

[25]

W. Zhang, M. Kruijf, A. Li, S. Lu, and K. Sankaralingam, "Conair:featherweight concurrency bug reovery via single-threaded idempotent execution," in International Conference on Architectural Support for Programming Languages and Operating Systems, 2013.

Digital Library

[26]

M. de Kruijf and K. Sankaralingam in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011.

[27]

S. A. Mahlke, W. Y. Chen, W. Hwu, B. Rau, and M. Schlanskar, "Sentinel scheduling for vliw and superscalar processors," in International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.

Digital Library

[28]

S. W. Kim, C.-L. Ooi, R. Eigenmann, B. Falsafi, and T. Vijaykumar, "Exploiting reference idempotency to reduce speculative storage overflow," ACM Transactions on Programming Languages and Systems, vol. 28, pp. 942--965, September 2006.

Digital Library

[29]

M. Hampton, Reducing Exception Management Overhead with Software Restart Markers. PhD thesis, MIT, 2008.

Digital Library

Cited By

Sultana TAllen BQasem ASarkar VKim H(2020)Intelligent Data Placement on Discrete GPU Nodes with Unified MemoryProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414651(139-151)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414651
Hope JNag TQasem A(2019)Energy-Efficient GPU Graph Processing with On-Demand Page Migration2019 Tenth International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC48788.2019.8957183(1-8)Online publication date: Oct-2019
https://doi.org/10.1109/IGSC48788.2019.8957183
Wen SCherkasova LLin FLiu X(2018)ProfDPProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205320(263-273)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205320
Show More Cited By

Index Terms

Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors
PACT '10: Proceedings of the 19th international conference on Parallel architectures and compilation techniques

Modern processors are evolving into hybrid, heterogeneous processors with both CPU and GPU cores used for general purpose computation. Several languages such as Brook, CUDA, and more recently OpenCL are being developed to fully harness the potential of ...
Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors
ICS '12: Proceedings of the 26th ACM international conference on Supercomputing

Intel MIC (Many Integrated Core) is the first x86-based coprocessor architecture aimed at accelerating multi-core HPC applications. In the most common usage model, parallel code sections are offloaded to the MIC coprocessor using LEO (Language ...
Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor
IPDPSW '13: Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum

The Intel® Xeon Phi™ coprocessor platform has a new software stack that enables new programming models. One such model is offload of computation from a host processor to a coprocessor that is a fully-capable Intel® Architecture CPU, namely, the Intel® ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

DOE
Google
NSF
IBM

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

6
Total Citations
View Citations
771
Total Downloads

Downloads (Last 12 months)327
Downloads (Last 6 weeks)25

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sultana TAllen BQasem ASarkar VKim H(2020)Intelligent Data Placement on Discrete GPU Nodes with Unified MemoryProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414651(139-151)Online publication date: 30-Sep-2020
https://dl.acm.org/doi/10.1145/3410463.3414651
Hope JNag TQasem A(2019)Energy-Efficient GPU Graph Processing with On-Demand Page Migration2019 Tenth International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC48788.2019.8957183(1-8)Online publication date: Oct-2019
https://doi.org/10.1109/IGSC48788.2019.8957183
Wen SCherkasova LLin FLiu X(2018)ProfDPProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205320(263-273)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205320
Chen GZhao YShen XZhou H(2017)EffiShaACM SIGPLAN Notices10.1145/3155284.301874852:8(3-16)Online publication date: 26-Jan-2017
https://dl.acm.org/doi/10.1145/3155284.3018748
Chen GZhao YShen XZhou HSarkar VRauchwerger L(2017)EffiShaProceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3018743.3018748(3-16)Online publication date: 26-Jan-2017
https://dl.acm.org/doi/10.1145/3018743.3018748
Huang YLi D(2017)Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.42(166-177)Online publication date: Sep-2017
https://doi.org/10.1109/CLUSTER.2017.42

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten