skip to main content
10.1145/2925426.2926277acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU

Published: 01 June 2016 Publication History

Abstract

A Graphic Processing Unit (GPU) system is typically equipped with many types of memory (e.g., global, constant, texture, shared, cache). Data placement determines what data are placed on which type of memory, essential for GPU memory performance. Prior optimizations of data placement always require a single view of a data object on memory, which limits the optimization effectiveness. In this work, we propose coherence-free multiview, an approach that allows multiple views of a single data object to co-exist on GPU memory during a GPU kernel execution. We demonstrate that under certain conditions, the multiple views can remain incoherent while facilitating enhanced data placement. We present a theorem and some compiler support to ensure the soundness of the usage of coherence-free multiview. We further develop reference-discerning data placement, a new way to enhance data placements on GPU. It enables more flexible data placements by using coherence-free multiview to leverage the slack in coherence requirement of some GPU programs. Experiments on three types of GPU systems show that, with less than 200KB space cost, the new data placement technique can provide a 1.6X average (up to 4.27X) speedup.

References

[1]
"NVIDIA CUDA." http://www.nvidia.com/cuda.
[2]
G. Chen, B. Wu, D. Li, and X. Shen, "Porple: An extensible optimizer for portable data placement on gpu," in Proceedings of the 47th International Conference on Microarchitecture, 2014.
[3]
B. Jang, D. Schaa, P. Mistry, and D. Kaeli, "Exploiting memory access patterns to improve memory performance in data-parallel architectures," IEEE Transactions on Parallel and Distributed Systems, vol. 22, no. 1, pp. 105--118, 2011.
[4]
A. Danalis, G. Marin, C. McCurdy, J. S. Meredith, P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter, "The scalable heterogeneous computing (shoc) benchmark suite," in GPGPU, 2010.
[5]
M. Kulkarni, M. Burtscher, K. Pingali, and C. Cascaval, "Lonestar: A suite of parallel irregular programs," in Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, 2009.
[6]
S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S.-H. Lee, and K. Skadron, "Rodinia: A benchmark suite for heterogeneous computing," in IISWC, 2009.
[7]
R. Nasre, M. Burtscher, and K. Pingali, "Atomic-free irregular computations on gpus," in Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, 2013.
[8]
S. Grauer-Gray, L. Xu, R. Searles, S. Ayalasomayajula, and J. Cavazos, "Auto-tuning a high-level language targeted to gpu codes," in Innovative Parallel Computing (InPar), 2012, pp. 1--10, IEEE, 2012.
[9]
B. Wang, B. Wu, D. Li, X. Shen, W. Yu, Y. Jiao, and J. S. Vetter, "Exploring hybrid memory for gpu energy efficiency through software-hardware co-design," in Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, PACT '13, (Piscataway, NJ, USA), pp. 93--102, IEEE Press, 2013.
[10]
W. Ma and G. Agrawal, "An integer programming framework for optimizing shared memory use on gpus," in PACT, pp. 553--554, 2010.
[11]
D. Jevdjic, S. Volos, and B. Falsafi, "Die-stacked dram caches for servers: hit ratio, latency, or bandwidth? have it all with footprint cache.," in ISCA, pp. 404--415, 2013.
[12]
L. E. Ramos, E. Gorbatov, and R. Bianchini, "Page placement in hybrid memory systems," in Proceedings of the International Conference on Supercomputing, ICS '11, pp. 85--95, 2011.
[13]
M. K. Qureshi, V. Srinivasan, and J. A. Rivers, "Scalable high performance main memory system using phase-change memory technology," in Proceedings of the 36th Annual International Symposium on Computer Architecture, ISCA '09, pp. 24--33, 2009.
[14]
A. Betts, N. Chong, A. F. Donaldson, J. Ketema, S. Qadeer, P. Thomson, and J. Wickerson, "The design and implementation of a verification technique for gpu kernels," ACM Trans. Program. Lang. Syst., vol. 37, pp. 10:1--10:49, May 2015.
[15]
D. R. Hower, B. A. Hechtman, B. M. Beckmann, B. R. Gaster, M. D. Hill, S. K. Reinhardt, and D. A. Wood, "Heterogeneous-race-free memory models," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, (New York, NY, USA), pp. 427--440, ACM, 2014.
[16]
B. R. Gaster, D. Hower, and L. Howes, "Hrf-relaxed: Adapting hrf to the complexities of industrial heterogeneous memory models," ACM Trans. Archit. Code Optim., vol. 12, pp. 7:1--7:26, Apr. 2015.
[17]
J. Wickerson, M. Batty, B. M. Beckmann, and A. F. Donaldson, "Remote-scope promotion: Clarified, rectified, and verified," in Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, (New York, NY, USA), pp. 731--747, ACM, 2015.
[18]
S. Che, J. W. Sheaffer, and K. Skadron, "Dymaxion: Optimizing memory access patterns for heterogeneous systems," in Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC '11, pp. 13:1--13:11, 2011.
[19]
E. Zhang, Y. Jiang, Z. Guo, K. Tian, and X. Shen, "On-the-fly elimination of dynamic irregularities for gpu computing," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, 2011.
[20]
B. Wu, Z. Zhao, E. Z. Zhang, Y. Jiang, and X. Shen, "Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on gpu," in Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2013.
[21]
M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan, "A compiler framework for optimization of affine loop nests for GPGPUs," in ICS'08: Proceedings of the 22nd Annual International Conference on Supercomputing, pp. 225--234, 2008.
[22]
I.-J. Sung, J. A. Stratton, and W.-M. W. Hwu, "Data layout transformation exploiting memory-level parallelism in structured grid many-core applications," in Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, PACT '10, pp. 513--522, 2010.
[23]
W. Jia, K. A. Shaw, and M. Martonosi, "Characterizing and improving the use of demand-fetched caches in gpus," in Proceedings of the 26th ACM international conference on Supercomputing, ICS '12, 2012.
[24]
Y. Yang, P. Xiang, J. Kong, and H. Zhou, "A gpgpu compiler for memory optimization and parallelism management," in Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI '10, pp. 86--97, 2010.
[25]
W. Zhang, M. Kruijf, A. Li, S. Lu, and K. Sankaralingam, "Conair:featherweight concurrency bug reovery via single-threaded idempotent execution," in International Conference on Architectural Support for Programming Languages and Operating Systems, 2013.
[26]
M. de Kruijf and K. Sankaralingam in Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, 2011.
[27]
S. A. Mahlke, W. Y. Chen, W. Hwu, B. Rau, and M. Schlanskar, "Sentinel scheduling for vliw and superscalar processors," in International Conference on Architectural Support for Programming Languages and Operating Systems, 1992.
[28]
S. W. Kim, C.-L. Ooi, R. Eigenmann, B. Falsafi, and T. Vijaykumar, "Exploiting reference idempotency to reduce speculative storage overflow," ACM Transactions on Programming Languages and Systems, vol. 28, pp. 942--965, September 2006.
[29]
M. Hampton, Reducing Exception Management Overhead with Software Restart Markers. PhD thesis, MIT, 2008.

Cited By

View all
  • (2020)Intelligent Data Placement on Discrete GPU Nodes with Unified MemoryProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414651(139-151)Online publication date: 30-Sep-2020
  • (2019)Energy-Efficient GPU Graph Processing with On-Demand Page Migration2019 Tenth International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC48788.2019.8957183(1-8)Online publication date: Oct-2019
  • (2018)ProfDPProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205320(263-273)Online publication date: 12-Jun-2018
  • Show More Cited By

Index Terms

  1. Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '16: Proceedings of the 2016 International Conference on Supercomputing
    June 2016
    547 pages
    ISBN:9781450343619
    DOI:10.1145/2925426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Coherence
    2. Compiler
    3. Data Placement
    4. GPGPU
    5. Memory
    6. Optimizations
    7. Runtime

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ICS '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)327
    • Downloads (Last 6 weeks)25
    Reflects downloads up to 09 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2020)Intelligent Data Placement on Discrete GPU Nodes with Unified MemoryProceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques10.1145/3410463.3414651(139-151)Online publication date: 30-Sep-2020
    • (2019)Energy-Efficient GPU Graph Processing with On-Demand Page Migration2019 Tenth International Green and Sustainable Computing Conference (IGSC)10.1109/IGSC48788.2019.8957183(1-8)Online publication date: Oct-2019
    • (2018)ProfDPProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205320(263-273)Online publication date: 12-Jun-2018
    • (2017)EffiShaACM SIGPLAN Notices10.1145/3155284.301874852:8(3-16)Online publication date: 26-Jan-2017
    • (2017)EffiShaProceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3018743.3018748(3-16)Online publication date: 26-Jan-2017
    • (2017)Performance Modeling for Optimal Data Placement on GPU with Heterogeneous Memory Systems2017 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER.2017.42(166-177)Online publication date: Sep-2017

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media