research-article

A tuning framework for software-managed memory hierarchies

Authors:

William J. DallyAuthors Info & Claims

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Pages 280 - 291

https://doi.org/10.1145/1454115.1454155

Published: 25 October 2008 Publication History

Abstract

Achieving good performance on a modern machine with a multi-level memory hierarchy, and in particular on a machine with software-managed memories, requires precise tuning of programs to the machine's particular characteristics. A large program on a multi-level machine can easily expose tens or hundreds of inter-dependent parameters which require tuning, and manually searching the resultant large, non-linear space of program parameters is a tedious process of trial-and-error. In this paper we present a general framework for automatically tuning general applications to machines with software-managed memory hierarchies. We evaluate our framework by measuring the performance of benchmarks that are tuned for a range of machines with different memory hierarchy configurations: a cluster of Intel P4 Xeon processors, a single Cell processor, and a cluster of Sony Playstation3's.

References

[1]

R. Allen and K. Kennedy. Optimizing Compilers for Mordern Architectures. 2001.

[2]

M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, and P. Sadayappan. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming, pages 1--10, 2008.

Digital Library

[3]

P. Bellens, J. M. Perez, R. M. Badia, and J. Labarta. CellSs: A programming model for the Cell BE architecture. In Proceedings of the ACM/IEEE Conference on Supercomputing, 2006.

Digital Library

[4]

J. Bilmes, K. Asanovic, C.-W. Chen, and J. Demmel. Optimizing matrix multiply using phipac: a portable high-performance ansi-c coding methodology. In Proceedings of the 1997 ACM International Conference on Supercomputing, 1997.

Digital Library

[5]

I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston, and P. Hanrahan. Brook for GPUs: Stream computing on graphics hardware. ACM Trans. Graph., 23(3):777--786, 2004.

Digital Library

[6]

C. Chen, J. Chame, and M. Hall. Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy. In CGO '05: Proceedings of the international symposium on Code generation and optimization, pages 111--122, 2005.

Digital Library

[7]

A. Chow, G. Fossum, and D. Brokenshire. A programming example: Large FFT on the Cell Broadband Engine, 2005.

[8]

A. Cohen, M. Sigler, S. Girbal, O. Temam, D. Parello, and N. Vasilache. Facilitating the search for compositions of program transformations. In ICS '05: Proceedings of the 19th annual international conference on Supercomputing, pages 151--160, 2005.

Digital Library

[9]

W. J. Dally, P. Hanrahan, M. Erez, T. J. Knight, F. Labonte, J. Ahn, N. Jayasena, U. J. Kapasi, A. Das, J. Gummaraju, and I. Buck. Merrimac: Supercomputing with streams. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, page 35, 2003.

Digital Library

[10]

C. Ding and K. Kennedy. The memory bandwidth bottleneck and its amelioration by a compiler. In IPDPS '00: Proceedings of the 14th International Symposium on Parallel and Distributed Processing, page 181, 2000.

Digital Library

[11]

C. Ding and K. Kennedy. Improving effective bandwidth through compiler enhancement of global cache reuse. In Parallel and Distributed Processing Symposium., Proceedings 15th International, 2001.

Digital Library

[12]

A. Eichenberger, J. O'Brien, K. O'Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, M. Gschwind, R. Archambault, Y. Gao, and R. Koo. Using advanced compiler technology to exploit the performance of the Cell Broadband Engine architecture. IBM System Journal, 45(1), 2006.

Digital Library

[13]

A. Eichenberger, K. O'Brien, K. O'Brien, P. Wu, T. Chen, P. Oden, D. Prener, J. Shepherd, B. So, Z. Sura, A. Wang, T. Zhang, P. Zhao, and M. Gschwind. Optimizing compiler for the Cell processor. In Proceedings of the 2005 International Conference on Parallel Architectures and Compilation Techniques, September 2005.

Digital Library

[14]

K. Fatahalian, T. J. Knight, M. Houston, M. Erez, D. R. Horn, L. Leem, J. Y. Park, M. Ren, A. Aiken, W. J. Dally, and P. Hanrahan. Sequoia: Programming the memory hierarchy. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, 2006.

Digital Library

[15]

M. Frigo and S. G. Johnson. The design and implementation of FFTW3. Proceedings of the IEEE, 93(2):216--231, 2005. special issue on "Program Generation, Optimization, and Platform Adaptation".

[16]

G. Fursin. A heuristic search algorithm based on unified transformation framework. In ICPPW '05: Proceedings of the 2005 International Conference on Parallel Processing Workshops, pages 137--144, 2005.

Digital Library

[17]

G. Fursin, M. O'Boyle, and P. Knijnenburg. Evaluating iterative compilation. In Proc. Languages and Compilers for Parallel Computers (LCPC), pages 305--315, 2002.

Digital Library

[18]

G. R. Gao, R. Olsen, V. Sarkar, and R. Thekkath. Collective loop fusion for array contraction. In 1992 Workshop on Languages and Compilers for Parallel Computing, number 757, pages 281--295, New Haven, Conn., 1992. Berlin: Springer Verlag.

Digital Library

[19]

S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-automatic composition of loop transformations for deep parallelism and memory hierarchies. Int. J. Parallel Program., 34(3):261--317, 2006.

Digital Library

[20]

M. Houston, J. Y. Park, M. Ren, T. J. Knight, K. Fatahalian, A. Aiken, W. J. Dally, and P. Hanrahan. A portable runtime interface for multi-level memory hierarchies. In PPoPP '08: Proceedings of the 13th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2008.

Digital Library

[21]

U. J. Kapasi, S. Rixner, W. J. Dally, B. Khailany, J. H. Ahn, P. Mattson, and J. D. Owens. Programmable stream processors. IEEE Computer, August 2003.

Digital Library

[22]

T. Kisuki, P. M. W. Knijnenburg, and M. F. P. O'Boyle. Combined selection of tile sizes and unroll factors using iterative compilation. In IEEE PACT, pages 237--248, 2000.

Digital Library

[23]

P. M. W. Knijnenburg, T. Kisuki, and M. F. P. O'Boyle. Iterative compilation. pages 171--187, 2002.

Digital Library

[24]

P. Mattson. A Programming System for the Imagine Media Processor. PhD thesis, Stanford University, 2002.

Digital Library

[25]

M. D. McCool. Data-parallel programming on the Cell BE and the GPU using the RapidMind development platform. In GSPx Multicore Applications Conference, 2006.

[26]

D. Pham, S. Asano, M. Bolliger, M. N. Day, H. P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D. Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa. The design and implementation of a first-generation CELL processor. In IEEE International Solid-State Circuits Conference, 2005.

[27]

S. Pop, A. Cohen, C. Bastoul, S. Girbal, P. Jouvelot, G.-A. Silber, and N. Vasilache. Graphite: Loop optimizations based on the polyhedral model for gcc. In Proceedings of the 4th GCC Developper's summit, 2006.

[28]

L.-N. Pouchet, C. Bastoul, J. Cavazos, and A. Cohen. Iterative optimization in the polyhedral model: Part ii, multidimensional time. In PLDI '08: Proceedings of the ACM SIGPLAN 2008 conference on Programming language design and implementation, 2008.

Digital Library

[29]

L.-N. Pouchet, C. Bastoul, A. Cohen, and N. Vasilache. Iterative optimization in the polyhedral model: Part i, one-dimensional time. In CGO '07: Proceedings of the International Symposium on Code Generation and Optimization, pages 144--156, 2007.

Digital Library

[30]

M. PÃijschel, B. Singer, J. Xiong, J. Moura, J. Johnson, D. Padua, M. Veloso, and R. Johnson. Spiral: A generator for platform-adapted libraries of signal processing algorithms. 2004.

[31]

A. Qasem and K. Kennedy. A cache-conscious profitablility model for empirical tuning of loop fusion. In Proceedings of the 18th International Workshop on Languages and Compilers for Parallel Computing (LCPC 2005), 2005.

Digital Library

[32]

A. Qasem and K. Kennedy. Profitable loop fusion and tiling using model-driven empirical search. In ICS '06: Proceedings of the 20th annual international conference on Supercomputing, pages 249--258, 2006.

Digital Library

[33]

A. Qasem, K. Kennedy, and J. Mellor-Crummey. Automatic tuning of whole applications using direct search and a performance-based transformation system. J. Supercomput., 36(2):183--196, 2006.

Digital Library

[34]

P. S. Uday Bondhugula, J. Ramanujan. A practical automatic polyhedral parallelizer and locality optimizer. In PLDI '08: Proceedings of the ACM SIGPLAN 2008 conference on Programming language design and implementation, 2008.

Digital Library

[35]

R. C. Whaley and J. J. Dongarra. Automatically tuned linear algebra software. In Proceedings of the 1998 ACM/IEEE Conference on Supercomputing, 1998.

Digital Library

[36]

M. E. Wolf, D. E. Maydan, and D.-K. Chen. Combining loop transformations considering caches and scheduling. In MICRO 29: Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, pages 274--286, 1996.

Digital Library

Cited By

S. F. X. Teixeira THenzinger AYadav RAiken AMohror KArnold DBadia R(2023)Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous MachinesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607079(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607079
Balaprakash PDongarra JGamblin THall MHollingsworth JNorris BVuduc R(2018)Autotuning in High-Performance Computing ApplicationsProceedings of the IEEE10.1109/JPROC.2018.2841200106:11(2068-2083)Online publication date: Nov-2018
https://doi.org/10.1109/JPROC.2018.2841200
Pereira LBentes CCastro MGarcia E(2017)A Case Study of Performance Optimization in a Heterogeneous Environment2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)10.1109/SBAC-PADW.2017.11(13-18)Online publication date: Oct-2017
https://doi.org/10.1109/SBAC-PADW.2017.11
Show More Cited By

Index Terms

A tuning framework for software-managed memory hierarchies

Recommendations

A tuning framework for software-managed memory hierarchies
Compilation for explicitly managed memory hierarchies
PPoPP '07: Proceedings of the 12th ACM SIGPLAN symposium on Principles and practice of parallel programming

We present a compiler for machines with an explicitly managed memory hierarchy and suggest that a primary role of any compiler for such architectures is to manipulate and schedule a hierarchy of bulk operations at varying scales of the application and ...
A circuit-architecture co-optimization framework for exploring nonvolatile memory hierarchies

Many new memory technologies are available for building future energy-efficient memory hierarchies. It is necessary to have a framework that can quickly find the optimal memory technology at each hierarchy level. In this work, we first build a circuit-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

October 2008

328 pages

ISBN:9781605582825

DOI:10.1145/1454115

General Chair:
Andreas Moshovos
University of Toronto, Canada
,
Program Chairs:
David Tarditi
Microsoft, USA
,
Kunle Olukotun
Stanford University, USA

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 October 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

PACT '08

Sponsor:

PACT '08: International Conference on Parallel Architectures and Compilation Techniques

October 25 - 29, 2008

Ontario, Toronto, Canada

Acceptance Rates

Overall Acceptance Rate 121 of 471 submissions, 26%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

34
Total Citations
View Citations
445
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)1

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

S. F. X. Teixeira THenzinger AYadav RAiken AMohror KArnold DBadia R(2023)Automated Mapping of Task-Based Programs onto Distributed and Heterogeneous MachinesProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3581784.3607079(1-13)Online publication date: 12-Nov-2023
https://dl.acm.org/doi/10.1145/3581784.3607079
Balaprakash PDongarra JGamblin THall MHollingsworth JNorris BVuduc R(2018)Autotuning in High-Performance Computing ApplicationsProceedings of the IEEE10.1109/JPROC.2018.2841200106:11(2068-2083)Online publication date: Nov-2018
https://doi.org/10.1109/JPROC.2018.2841200
Pereira LBentes CCastro MGarcia E(2017)A Case Study of Performance Optimization in a Heterogeneous Environment2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)10.1109/SBAC-PADW.2017.11(13-18)Online publication date: Oct-2017
https://doi.org/10.1109/SBAC-PADW.2017.11
Muralidharan SRoy AHall MGarland MRai P(2016)Architecture-Adaptive Code Variant TuningACM SIGARCH Computer Architecture News10.1145/2980024.287241144:2(325-338)Online publication date: 25-Mar-2016
https://dl.acm.org/doi/10.1145/2980024.2872411
Muralidharan SRoy AHall MGarland MRai P(2016)Architecture-Adaptive Code Variant TuningACM SIGOPS Operating Systems Review10.1145/2954680.287241150:2(325-338)Online publication date: 25-Mar-2016
https://doi.org/10.1145/2954680.2872411
Muralidharan SRoy AHall MGarland MRai P(2016)Architecture-Adaptive Code Variant TuningACM SIGPLAN Notices10.1145/2954679.287241151:4(325-338)Online publication date: 25-Mar-2016
https://dl.acm.org/doi/10.1145/2954679.2872411
Muralidharan SRoy AHall MGarland MRai PConte TZhou Y(2016)Architecture-Adaptive Code Variant TuningProceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/2872362.2872411(325-338)Online publication date: 25-Mar-2016
https://dl.acm.org/doi/10.1145/2872362.2872411
Roy ABalaprakash PHovland PWild S(2016)Exploiting Performance Portability in Search Algorithms for Autotuning2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW.2016.85(1535-1544)Online publication date: May-2016
https://doi.org/10.1109/IPDPSW.2016.85
Slaughter ELee WTreichler SBauer MAiken AKern JVetter J(2015)RegentProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/2807591.2807629(1-12)Online publication date: 15-Nov-2015
https://dl.acm.org/doi/10.1145/2807591.2807629
Basu PHall MKhan MMaindola SMuralidharan SRamalingam SRivera AShantharam MVenkat A(2013)Towards making autotuning mainstreamThe International Journal of High Performance Computing Applications10.1177/109434201349364427:4(379-393)Online publication date: Jul-2013
https://doi.org/10.1177/1094342013493644
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten