skip to main content
10.1145/2925426.2926288acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes

Published: 01 June 2016 Publication History

Abstract

Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance.
In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.

References

[1]
E. Athanasaki, N. Koziris, and P. Tsanakas. A tile size selection analysis for blocked array layouts. In INTERACT-2005. 9th Annual Workshop, pages 70--80.
[2]
A.-H. A. Badawy, A. Aggarwal, D. Yeung, and C.-W. Tseng. Evaluating the impact of memory system performance on software prefetching and locality optimizations. In ICS '01, pages 486--500.
[3]
V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In SC '12, pages 1--11, 2012.
[4]
B. Bao and C. Ding. Defensive loop tiling for shared cache. In CGO '13, pages 1--11.
[5]
C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT '04, pages 7--16.
[6]
C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT '13, pages 7--16, Juan-les-Pins, France, September 2004.
[7]
J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using phipac: A portable, high-performance, ansi c coding methodology. In ICS '97, pages 340--347.
[8]
U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. In L. Hendren, editor, In CC'08, volume 4959 of Lecture Notes in Computer Science, pages 132--146. 2008.
[9]
J. Chame and S. Moon. A tile selection algorithm for data locality and cache interference. In ICS '99, pages 492--499.
[10]
C. Chen, J. Chame, and M. Hall. Chill: A framework for composing high-level loop transformations. U. of Southern California, Tech. Rep, pages 08--897, 2008.
[11]
S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI'95, 30(6):279--290.
[12]
K. Cooper and J. Sandoval. Portable Techniques to Find Effective Memory Hierarchy Parameters. Technical report, 2011.
[13]
C. ŢĂpuş, I.-H. Chung, and J. K. Hollingsworth. Active harmony: Towards automated performance tuning. In SC '02, pages 1--11.
[14]
Y. Ding, J. Ansel, K. Veeramachaneni, X. Shen, U. O'Reilly, and S. P. Amarasinghe. Autotuning algorithmic choice for input sensitivity. In In PLDI'15, pages 379--390.
[15]
J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. In TOMS'90, 16(1):1--17.
[16]
Z. Fang, S. Mehta, P.-C. Yew, A. Zhai, J. Greensky, G. Beeraka, and B. Zang. Measuring microarchitectural details of multi- and many-core memory systems through microbench marking. ACMTrans. Archit. Code Optim., 11(4):55:1--55:26, Jan. 2015.
[17]
M. Frigo. A fast fourier transform compiler. In PLDI '99, pages 169--180.
[18]
S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In ICS '97, pages 317--324.
[19]
J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS '12, pages 311--320, New York, NY, USA, 2012. ACM.
[20]
D. Kim, S. S.-w. Liao, P. H. Wang, J. d. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. P. Shen. Physical experimentation with prefetching helper threads on intel's hyper-threaded processors. In CGO '04.
[21]
J. Lee, H. Kim, and R. Vuduc. When prefetching works, when it doesn't, and why. In TACO'12, 9(1):2:1--2:29.
[22]
A. W. Lim, S.-W. Liao, and M. S. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In PPoPP '01, pages 103--112.
[23]
S. Mehta, G. Beeraka, and P.-C. Yew. Tile size selection revisited. In TACO'13, 10(4):35:1--35:27.
[24]
S. Mehta, Z. Fang, A. Zhai, and P.-C. Yew. Multi-stage coordinated prefetching for present-day processors. In ICS '14, pages 73--82.
[25]
S. Moon and R. H. Saavedra. Hyperblocking: A data reorganization method to eliminate cache conflicts in tiled loop nests. Technical report, Conflicts in Tiled Loop Nests, USC-CS-98-671, USC Computer Science, 1998.
[26]
T. C. Mowry,M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-V'92, pages 62--73.
[27]
L.-N. Pouchet. Polybench Benchmark Suite. Available at http://www\-roc.inria.fr/~pouchet/software/polybench/.
[28]
A. Qasem, K. Kennedy, and J. M. Mellor-Crummey. Automatic tuning of whole applications using direct search and a performance-based transformation system. In SC'06, 36(2):183--196.
[29]
M. Rahman, L.-N. Pouchet, and P. Sadayappan. Neural network assisted tile size selection. In IWAPT '2010.
[30]
J. Reinders. VTune performance analyzer essentials.
[31]
G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In CC '99, pages 168--182.
[32]
G. Rivera and C.-W. Tseng. Tiling optimizations for 3d scientific computations. In SC'00.
[33]
R. Saavedra, W. Mao, D. Park, J. Chame, and S. Moon. The combined effectiveness of unimodular transformations, tiling, and software prefetching. In IPPS '96, pages 39--45.
[34]
J. Shirako, K. Sharma, N. Fauzia, L.-N. Pouchet, J. Ramanujam, P. Sadayappan, and V. Sarkar. Analytical bounds for optimal tile size selection. In CC'12, pages 101--121.
[35]
R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache accurate time skewing in iterative stencil computations. In ICPP '11, pages 571--581, Sept 2011.
[36]
Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In SPAA '11, pages 117--128, 2011.
[37]
A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In IPDPS '09, pages 1--12.
[38]
F. G. Van Zee and R. A. van de Geijn. Blis: A framework for rapidly instantiating blas functionality. TOMS'15, 41(3):14:1--14:33.
[39]
R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the atlas project. In Parallel Computing, 27:3--35.
[40]
M. Wolfe. More iteration space tiling. In SC '89, pages 655--664.
[41]
Q. Yi and J. Guo. Extensive parameterization and tuning of architecture-sensitive optimizations. In ICCS'11, pages 2156--2165.
[42]
Q. Yi, K. Seymour, H. You, R. W. Vuduc, and D. J. Quinlan. POET: parameterized optimizations for empirical tuning. In IPDPS'07, pages 1--8.

Cited By

View all
  • (2023)Designing Secure Performance Metrics for Last Level Cache2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00069(383-392)Online publication date: May-2023
  • (2021)Accelerating all-electron ab initio simulation of raman spectra for biological systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476160(1-15)Online publication date: 14-Nov-2021
  • (2020)Deriving parametric multi-way recursive divide-and-conquer dynamic programming algorithms using polyhedral compilersProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377916(317-329)Online publication date: 22-Feb-2020
  • Show More Cited By

Index Terms

  1. TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '16: Proceedings of the 2016 International Conference on Supercomputing
    June 2016
    547 pages
    ISBN:9781450343619
    DOI:10.1145/2925426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Loop tiling
    2. Multi-core
    3. Prefetching

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    ICS '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)22
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Designing Secure Performance Metrics for Last Level Cache2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00069(383-392)Online publication date: May-2023
    • (2021)Accelerating all-electron ab initio simulation of raman spectra for biological systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476160(1-15)Online publication date: 14-Nov-2021
    • (2020)Deriving parametric multi-way recursive divide-and-conquer dynamic programming algorithms using polyhedral compilersProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377916(317-329)Online publication date: 22-Feb-2020
    • (2019)Triton: an intermediate language and compiler for tiled neural network computationsProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329973(10-19)Online publication date: 22-Jun-2019
    • (2019)An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral CompilationACM Transactions on Architecture and Code Optimization10.1145/329344915:4(1-23)Online publication date: 8-Jan-2019
    • (2019)Revisiting the Parallel Strategy for DOACROSS LoopsJournal of Computer Science and Technology10.1007/s11390-019-1919-734:2(456-475)Online publication date: 22-Mar-2019
    • (2018)Revisiting Loop Tiling for DatacentersProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205306(328-340)Online publication date: 12-Jun-2018
    • (2018)Loop transformations leveraging hardware prefetchingProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168823(254-264)Online publication date: 2018
    • (2018)Loop transformations leveraging hardware prefetchingProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168823(254-264)Online publication date: 24-Feb-2018
    • (2018)Optimizing Tiled Matrix-Matrix Product According to Cache Performance Enhancement2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom)10.1109/BDCloud.2018.00098(637-644)Online publication date: Dec-2018
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media