skip to main content
article

Analyzing data reuse for cache reconfiguration

Published: 01 November 2005 Publication History

Abstract

Classical compiler optimizations assume a fixed cache architecture and modify the program to take best advantage of it. In some cases, this may not be the best strategy because each nest might work best with a different cache configuration and transforming a nest for a given fixed cache configuration may not be possible due to data and control dependences. Working with a fixed cache configuration can also increase energy consumption in loops where the best required configuration is smaller than the default (fixed) one. In this paper, we take an alternate approach and modify the cache configuration for each nest, depending on the access pattern exhibited by the nest. We call this technique compiler-directed cache polymorphism (CDCP). More specifically, in this paper, we make the following contributions. First, we present an approach for analyzing data reuse properties of loop nests. Second, we give algorithms to simulate the footprints of array references in their reuse space. Third, based on our reuse analysis, we present an optimization algorithm to compute the cache configurations for each loop nest. Our experimental results show that CDCP is very effective in finding the near-optimal data cache configurations for different nests in array-intensive applications.

References

[1]
Ahmed, N., Mateev, N., and Pingali, K. 2000. Synthesizing transformations for locality enhancement of imperfectly-nested loop nests. In Proceedings of the 2000 International Conference on Supercomputing, Santa Fe, New Mexico. 141--152.
[2]
Albonesi, D. H. 1999. Selective cache ways: On-demand cache resource allocation. In Proc. of the 32nd Annual International Conference on Microarchitecture.
[3]
Bacon, D. F., Graham, S. L., and Sharp, O. J. 1994. Compiler transformations for high-performance computing. ACM Computing Surveys 26, 4, 345--420.
[4]
Burger, D., Kagi, A., and Hrishikesh, M. S. 2000. Memory hierarchy extensions to simplescalar 3.0. Tech. Rep. TR99-25, Department of Computer Sciences, The University of Texas at Austin, Austin, TX.
[5]
Chame, J. 1997. Compiler analysis of cache interference and its applications to compiler optimizations. Ph.D. thesis, Dept. of Computer Engineering, University of Southern California, Los Angeles, CA.
[6]
Cmelik, B. and Keppel, D. 1994. Shade: a fast instruction-set simulator for execution profiling. In Proc. of the 1994 ACM SIGMETRICES Conf. on the Measurement and Modeling of Computer Systems. 128--137.
[7]
Gannon, D., Jalby, W., and Gallivan, K. 1988. Strategies for cache and local memory management by global program transformation. Journal of Parallel and Distributed Computing 5, 5 (Oct.), 587--616.
[8]
Ghosh, S., Martonosi, M., and Malik, S. 1997. Cache miss equations: An analytical representation of cache misses. In Proc. of the 11th International Conference on Supercomputing (ICS-97).
[9]
Ghosh, S., Martonosi, M., and Malik, S. 1998. Precise miss analysis for program transformations with caches of arbitrary associativity. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems, San Jose, CA. 228--239.
[10]
Givargis, T., Henkel, J., and Vahid, F. 1999. Interface and cache power exploration for core-based embedded systems. In Proceedings of International Conference on Computer Aided Design (ICCAD). 270--273.
[11]
Grunwald, D., Zorn, B. G., and Henderson, R. 1993. Improving the cache locality of memory allocation. In Proceedings of the ACM SIGPLAN'93 Conference on Programming Language Design and Implementation (PLDI), Albuquerque, New Mexico. 177--186.
[12]
Kadayif, I., Kandemir, M., Vijaykrishnan, N., Irwin, M. J., and Ramanujam, J. 2001. Morphable cache architectures: potential benefits. In ACM Workshop on Languages, Compilers, and Tools for Embedded Systems (LCTES'01).
[13]
Manjikian, N. and Abdelrahman, T. S. 1995. Fusion of loops for parallelism and locality. In Proceedings of the 24th International Conference on Parallel Processing (ICPP'95). Oconomowoc, Wisconsin. II:19--28.
[14]
McKinley, K. S., Carr, S., and Tseng, C.-W. 1996. Improving data locality with loop transformations. ACM Transactions on Programming Lanaguages and Systems 18, 4 (July), 424--453.
[15]
Ranganathan, P., Adve, S., and Jouppi, N. P. 2000. Reconfigurable caches and their application to media processing. In Proc. of the 27th Annual International Symposium on Computer Architecture. 214--224.
[16]
Reinman, G. and Jouppi, N. 1999. An integrated cache timing and power model. Cacti 2.0 technical report, COMPAQ Western Research Lab.
[17]
Rivera, G. and Tseng, C.-W. 1998. Eliminating conflict misses for high performance architectures. In Proceedings of the 1998 International Conference on Supercomputing, Melbourne, Australia. 353--360.
[18]
Simunic, T., Micheli, G. D., and Benini, L. 1999. Energy-efficient design of battery-powered embedded systems. In Proceedings of International Symposium on Low Power Electronics and Design. 212--217.
[19]
Song, Y. and Li, Z. 1999. New tiling techniques to improve cache temporal locality. In Proceedings of the SIGPLAN '99 Conference on Programming Language Design and Implementation, Atlanta, GA.
[20]
Stanford Compiler Group. 1994. The SUIF Library, version 1.0 edition. Stanford Compiler Group.
[21]
Temam, O., Fricker, C., and Jalby, W. 1994. Cache interference phenomena. In Proc. of ACM SIGMETRICS Conference on Measurement & Modeling Computer Systems.
[22]
Tiwari, V., Malik, S., Wolfe, A., and Lee, M. 1996. Instruction level power analysis and optimization of software. Journal of VLSI Signal Processing 13, 2, 1--18.
[23]
Wolf, M. and Lam, M. 1991. A data locality optimizing algorithm. In Proc. of SIGPLAN'91 conf. Programming Language Design and Implementation. 30--44.
[24]
Yi, Q., Adve, V., and Kennedy, K. 2000. Transforming loops to recursion for multi-level memory hierarchies. In Proceedings of the SIGPLAN '00 Conference on Programming Language Design and Implementation, Vancouver, Canada.

Cited By

View all
  • (2014)Using an adaptive HPC runtime system to reconfigure the cache hierarchyProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.90(1047-1058)Online publication date: 16-Nov-2014
  • (2014)Loop scheduling with memory access reduction subject to register constraints for DSP applicationsSoftware—Practice & Experience10.1002/spe.218644:8(999-1026)Online publication date: 1-Aug-2014
  • (2013)An Optimized GP-GPU Warp Scheduling Algorithm for Sparse Matrix-Vector MultiplicationProceedings of the 2013 IEEE Eighth International Conference on Networking, Architecture and Storage10.1109/NAS.2013.35(222-231)Online publication date: 17-Jul-2013
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 4, Issue 4
November 2005
259 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/1113830
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 01 November 2005
Published in TECS Volume 4, Issue 4

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Embedded software
  2. cache locality
  3. cache polymorphism
  4. compilers
  5. data reuse
  6. energy consumption

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2014)Using an adaptive HPC runtime system to reconfigure the cache hierarchyProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC.2014.90(1047-1058)Online publication date: 16-Nov-2014
  • (2014)Loop scheduling with memory access reduction subject to register constraints for DSP applicationsSoftware—Practice & Experience10.1002/spe.218644:8(999-1026)Online publication date: 1-Aug-2014
  • (2013)An Optimized GP-GPU Warp Scheduling Algorithm for Sparse Matrix-Vector MultiplicationProceedings of the 2013 IEEE Eighth International Conference on Networking, Architecture and Storage10.1109/NAS.2013.35(222-231)Online publication date: 17-Jul-2013
  • (2011)Dynamically Adaptive I-Cache Partitioning for Energy-Efficient Embedded MultitaskingIEEE Transactions on Very Large Scale Integration (VLSI) Systems10.1109/TVLSI.2010.206699519:11(2067-2080)Online publication date: 1-Nov-2011
  • (2010)Cache partitioning for energy-efficient and interference-free embedded multitaskingACM Transactions on Embedded Computing Systems10.1145/1698772.16987749:3(1-35)Online publication date: 5-Mar-2010
  • (2010)Studying Filter Cache Bypassing on Embedded SystemsProceedings of the 2010 10th IEEE International Conference on Computer and Information Technology10.1109/CIT.2010.296(1679-1686)Online publication date: 29-Jun-2010
  • (2009)Instruction Hints for Super Efficient Data CachesProceedings of the 9th International Conference on Computational Science10.1007/978-3-642-01973-9_76(677-685)Online publication date: 25-May-2009
  • (2008)A compiler approach to managing storage and memory bandwidth in configurable architecturesACM Transactions on Design Automation of Electronic Systems10.1145/1391962.139196913:4(1-26)Online publication date: 3-Oct-2008
  • (2007)Eliminating inter-process cache interference through cache reconfigurability for real-time and low-power embedded multi-tasking systemsProceedings of the 2007 international conference on Compilers, architecture, and synthesis for embedded systems10.1145/1289881.1289917(198-207)Online publication date: 30-Sep-2007

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media