research-article

TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes

Authors:

Nishad Trivedi,

Pen-Chung YewAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 38, Pages 1 - 12

https://doi.org/10.1145/2925426.2926288

Published: 01 June 2016 Publication History

Abstract

Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance.

In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.

References

[1]

E. Athanasaki, N. Koziris, and P. Tsanakas. A tile size selection analysis for blocked array layouts. In INTERACT-2005. 9th Annual Workshop, pages 70--80.

Digital Library

[2]

A.-H. A. Badawy, A. Aggarwal, D. Yeung, and C.-W. Tseng. Evaluating the impact of memory system performance on software prefetching and locality optimizations. In ICS '01, pages 486--500.

Digital Library

[3]

V. Bandishti, I. Pananilath, and U. Bondhugula. Tiling stencil computations to maximize parallelism. In SC '12, pages 1--11, 2012.

Digital Library

[4]

B. Bao and C. Ding. Defensive loop tiling for shared cache. In CGO '13, pages 1--11.

Digital Library

[5]

C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT '04, pages 7--16.

Digital Library

[6]

C. Bastoul. Code generation in the polyhedral model is easier than you think. In PACT '13, pages 7--16, Juan-les-Pins, France, September 2004.

Digital Library

[7]

J. Bilmes, K. Asanovic, C.-W. Chin, and J. Demmel. Optimizing matrix multiply using phipac: A portable, high-performance, ansi c coding methodology. In ICS '97, pages 340--347.

Digital Library

[8]

U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. In L. Hendren, editor, In CC'08, volume 4959 of Lecture Notes in Computer Science, pages 132--146. 2008.

Digital Library

[9]

J. Chame and S. Moon. A tile selection algorithm for data locality and cache interference. In ICS '99, pages 492--499.

Digital Library

[10]

C. Chen, J. Chame, and M. Hall. Chill: A framework for composing high-level loop transformations. U. of Southern California, Tech. Rep, pages 08--897, 2008.

[11]

S. Coleman and K. S. McKinley. Tile size selection using cache organization and data layout. In PLDI'95, 30(6):279--290.

Digital Library

[12]

K. Cooper and J. Sandoval. Portable Techniques to Find Effective Memory Hierarchy Parameters. Technical report, 2011.

[13]

C. ŢĂpuş, I.-H. Chung, and J. K. Hollingsworth. Active harmony: Towards automated performance tuning. In SC '02, pages 1--11.

Digital Library

[14]

Y. Ding, J. Ansel, K. Veeramachaneni, X. Shen, U. O'Reilly, and S. P. Amarasinghe. Autotuning algorithmic choice for input sensitivity. In In PLDI'15, pages 379--390.

Digital Library

[15]

J. J. Dongarra, J. Du Croz, S. Hammarling, and I. S. Duff. A set of level 3 basic linear algebra subprograms. In TOMS'90, 16(1):1--17.

Digital Library

[16]

Z. Fang, S. Mehta, P.-C. Yew, A. Zhai, J. Greensky, G. Beeraka, and B. Zang. Measuring microarchitectural details of multi- and many-core memory systems through microbench marking. ACMTrans. Archit. Code Optim., 11(4):55:1--55:26, Jan. 2015.

Digital Library

[17]

M. Frigo. A fast fourier transform compiler. In PLDI '99, pages 169--180.

Digital Library

[18]

S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical representation of cache misses. In ICS '97, pages 317--324.

Digital Library

[19]

J. Holewinski, L.-N. Pouchet, and P. Sadayappan. High-performance code generation for stencil computations on gpu architectures. In ICS '12, pages 311--320, New York, NY, USA, 2012. ACM.

Digital Library

[20]

D. Kim, S. S.-w. Liao, P. H. Wang, J. d. Cuvillo, X. Tian, X. Zou, H. Wang, D. Yeung, M. Girkar, and J. P. Shen. Physical experimentation with prefetching helper threads on intel's hyper-threaded processors. In CGO '04.

Digital Library

[21]

J. Lee, H. Kim, and R. Vuduc. When prefetching works, when it doesn't, and why. In TACO'12, 9(1):2:1--2:29.

Digital Library

[22]

A. W. Lim, S.-W. Liao, and M. S. Lam. Blocking and array contraction across arbitrarily nested loops using affine partitioning. In PPoPP '01, pages 103--112.

Digital Library

[23]

S. Mehta, G. Beeraka, and P.-C. Yew. Tile size selection revisited. In TACO'13, 10(4):35:1--35:27.

Digital Library

[24]

S. Mehta, Z. Fang, A. Zhai, and P.-C. Yew. Multi-stage coordinated prefetching for present-day processors. In ICS '14, pages 73--82.

Digital Library

[25]

S. Moon and R. H. Saavedra. Hyperblocking: A data reorganization method to eliminate cache conflicts in tiled loop nests. Technical report, Conflicts in Tiled Loop Nests, USC-CS-98-671, USC Computer Science, 1998.

[26]

T. C. Mowry,M. S. Lam, and A. Gupta. Design and evaluation of a compiler algorithm for prefetching. In ASPLOS-V'92, pages 62--73.

Digital Library

[27]

L.-N. Pouchet. Polybench Benchmark Suite. Available at http://www\-roc.inria.fr/~pouchet/software/polybench/.

[28]

A. Qasem, K. Kennedy, and J. M. Mellor-Crummey. Automatic tuning of whole applications using direct search and a performance-based transformation system. In SC'06, 36(2):183--196.

Digital Library

[29]

M. Rahman, L.-N. Pouchet, and P. Sadayappan. Neural network assisted tile size selection. In IWAPT '2010.

[30]

J. Reinders. VTune performance analyzer essentials.

[31]

G. Rivera and C.-W. Tseng. A comparison of compiler tiling algorithms. In CC '99, pages 168--182.

Digital Library

[32]

G. Rivera and C.-W. Tseng. Tiling optimizations for 3d scientific computations. In SC'00.

Digital Library

[33]

R. Saavedra, W. Mao, D. Park, J. Chame, and S. Moon. The combined effectiveness of unimodular transformations, tiling, and software prefetching. In IPPS '96, pages 39--45.

Digital Library

[34]

J. Shirako, K. Sharma, N. Fauzia, L.-N. Pouchet, J. Ramanujam, P. Sadayappan, and V. Sarkar. Analytical bounds for optimal tile size selection. In CC'12, pages 101--121.

Digital Library

[35]

R. Strzodka, M. Shaheen, D. Pajak, and H.-P. Seidel. Cache accurate time skewing in iterative stencil computations. In ICPP '11, pages 571--581, Sept 2011.

Digital Library

[36]

Y. Tang, R. A. Chowdhury, B. C. Kuszmaul, C.-K. Luk, and C. E. Leiserson. The pochoir stencil compiler. In SPAA '11, pages 117--128, 2011.

Digital Library

[37]

A. Tiwari, C. Chen, J. Chame, M. Hall, and J. K. Hollingsworth. A scalable auto-tuning framework for compiler optimization. In IPDPS '09, pages 1--12.

Digital Library

[38]

F. G. Van Zee and R. A. van de Geijn. Blis: A framework for rapidly instantiating blas functionality. TOMS'15, 41(3):14:1--14:33.

Digital Library

[39]

R. C. Whaley, A. Petitet, and J. J. Dongarra. Automated empirical optimizations of software and the atlas project. In Parallel Computing, 27:3--35.

Digital Library

[40]

M. Wolfe. More iteration space tiling. In SC '89, pages 655--664.

Digital Library

[41]

Q. Yi and J. Guo. Extensive parameterization and tuning of architecture-sensitive optimizations. In ICCS'11, pages 2156--2165.

[42]

Q. Yi, K. Seymour, H. You, R. W. Vuduc, and D. J. Quinlan. POET: parameterized optimizations for empirical tuning. In IPDPS'07, pages 1--8.

Cited By

Roy PEshete BSu P(2023)Designing Secure Performance Metrics for Last Level Cache2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00069(383-392)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00069
Shang HLi FZhang YLiu YZhang LWu MWu YWei DCui HLiu XWang FYe YGao YNi SChen XChen Dde Supinski BHall MGamblin T(2021)Accelerating all-electron ab initio simulation of raman spectra for biological systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476160(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476160
Javanmard MAhmad ZKong MPouchet LChowdhury RHarrison RMars JTang LXue JWu P(2020)Deriving parametric multi-way recursive divide-and-conquer dynamic programming algorithms using polyhedral compilersProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377916(317-329)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3368826.3377916
Show More Cited By

Index Terms

TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Coordinated control of multiple prefetchers in multi-core systems
MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture

Aggressive prefetching is very beneficial for memory latency tolerance of many applications. However, it faces significant challenges in multi-core systems. Prefetchers of different cores on a chip multiprocessor (CMP) can cause significant interference ...
Defensive loop tiling for multi-core processor
MSPC '12: Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness

Loop tiling is a compiler transformation that tailors an application's working set to fit in a cache hierarchy. On today's multicore processors, part of the hierarchy, especially the last level cache (LLC) is shared. In this paper, we show that cache ...
Criticality aware tiered cache hierarchy: a fundamental relook at multi-level cache hierarchies
ISCA '18: Proceedings of the 45th Annual International Symposium on Computer Architecture

On-die caches are a popular method to help hide the main memory latency. However, it is difficult to build large caches without substantially increasing their access latency, which in turn hurts performance. To overcome this difficulty, on-die caches ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

12
Total Citations
View Citations
323
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)5

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Roy PEshete BSu P(2023)Designing Secure Performance Metrics for Last Level Cache2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW59300.2023.00069(383-392)Online publication date: May-2023
https://doi.org/10.1109/IPDPSW59300.2023.00069
Shang HLi FZhang YLiu YZhang LWu MWu YWei DCui HLiu XWang FYe YGao YNi SChen XChen Dde Supinski BHall MGamblin T(2021)Accelerating all-electron ab initio simulation of raman spectra for biological systemsProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3458817.3476160(1-15)Online publication date: 14-Nov-2021
https://dl.acm.org/doi/10.1145/3458817.3476160
Javanmard MAhmad ZKong MPouchet LChowdhury RHarrison RMars JTang LXue JWu P(2020)Deriving parametric multi-way recursive divide-and-conquer dynamic programming algorithms using polyhedral compilersProceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization10.1145/3368826.3377916(317-329)Online publication date: 22-Feb-2020
https://dl.acm.org/doi/10.1145/3368826.3377916
Tillet PKung HCox DMattson TMuzahid ASolar-Lezama A(2019)Triton: an intermediate language and compiler for tiled neural network computationsProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages10.1145/3315508.3329973(10-19)Online publication date: 22-Jun-2019
https://dl.acm.org/doi/10.1145/3315508.3329973
Sato YYuki TEndo T(2019)An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral CompilationACM Transactions on Architecture and Code Optimization10.1145/329344915:4(1-23)Online publication date: 8-Jan-2019
https://dl.acm.org/doi/10.1145/3293449
Liu SCui YZou NZhu WZhang DWu W(2019)Revisiting the Parallel Strategy for DOACROSS LoopsJournal of Computer Science and Technology10.1007/s11390-019-1919-734:2(456-475)Online publication date: 22-Mar-2019
https://doi.org/10.1007/s11390-019-1919-7
Zhao JCui HZhang YXue JFeng X(2018)Revisiting Loop Tiling for DatacentersProceedings of the 2018 International Conference on Supercomputing10.1145/3205289.3205306(328-340)Online publication date: 12-Jun-2018
https://dl.acm.org/doi/10.1145/3205289.3205306
Sioutas SStuijk SCorporaal HBasten TSomers L(2018)Loop transformations leveraging hardware prefetchingProceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 201810.1145/3179541.3168823(254-264)Online publication date: 2018
https://doi.org/10.1145/3179541.3168823
Sioutas SStuijk SCorporaal HBasten TSomers LKnoop JSchordan MJohnson TO'Boyle M(2018)Loop transformations leveraging hardware prefetchingProceedings of the 2018 International Symposium on Code Generation and Optimization10.1145/3168823(254-264)Online publication date: 24-Feb-2018
https://dl.acm.org/doi/10.1145/3168823
Hammami ESlama Y(2018)Optimizing Tiled Matrix-Matrix Product According to Cache Performance Enhancement2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom)10.1109/BDCloud.2018.00098(637-644)Online publication date: Dec-2018
https://doi.org/10.1109/BDCloud.2018.00098
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten