research-article

A survey on cache tuning from a power/energy perspective

Authors:
Wei Zang

University of Florida, Gainesville, FL

University of Florida, Gainesville, FL
View Profile

,
Ann Gordon-Ross

university of florida, Gainesville, FL

university of florida, Gainesville, FL
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 45 Issue 3Article No.: 32pp 1–49https://doi.org/10.1145/2480741.2480749

Published:03 July 2013Publication History

ACM Computing Surveys

Abstract

Low power and/or energy consumption is a requirement not only in embedded systems that run on batteries or have limited cooling capabilities, but also in desktop and mainframes where chips require costly cooling techniques. Since the cache subsystem is typically the most power/energy-consuming subsystem, caches are good candidates for power/energy optimizations, and therefore, cache tuning techniques are widely researched. This survey focuses on state-of-the-art offline static and online dynamic cache tuning techniques and summarizes the techniques' attributes, major challenges, and potential research trends to inspire novel ideas and future research avenues.

References

Albonesi, D. H. 1999. Selective cache way: On-demand cache resource allocation. In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, Washington, DC, 248--259. Google ScholarDigital Library
Ammons, G., Ball, T., and Larus, J. R. 1997. Exploiting hardware performance counters with flow and context sensitive profiling, In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 85--96. Google ScholarDigital Library
Anderson, J. M., Berc, L. M., Dean, J., Ghemawat, S., Henzinger M. R., Leung S. A., Sites, R. L., Vandevoorde, M. T., Waldspurger C. A., and Weihl W. E. 1997. Continuous profiling: Where have all the cycles gone&quest; ACM Trans. Comput. Syst. 15, 4, 357--390. Google ScholarDigital Library
Austin, T., Larson, E., and Ernst, D. 2002. SimpleScalar: An infrastructure for comput. system modeling. IEEE Comput. 35, 2, 59--67. Google ScholarDigital Library
Awasthi, M., Sudan, K., Balasubramonian, R., and Carter, J. 2009. Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In Proceedings of Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 250--261.Google Scholar
Balasubramonian, R., Albonesi, D., Buyuktosunoglu, A. and Dwarkadas, S. 2000. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, ACM, New York, NY, 245--257. Google ScholarDigital Library
Balasubramonian, R., Jouppi, N. P., and Muralimanohar, N. 2011. Multi-core cache hierarchies. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, San Rafael, CA. Google ScholarDigital Library
Beckmann, B., Marty, M., and Wood, D. 2006. ASR: Adaptive selective replication for CMP caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, Los Alamitos, CA, 443--454. Google ScholarDigital Library
Bedichek, R. 2004. SimNow: Fast platform simulation purely in software. In Proceedings of the Symposium on High Performance Chips (HOT CHIPS).Google Scholar
Bellard, F. 2005. QEMU, a fast and portable dynamic translator, USENIX' 05 Technical Program. Google ScholarDigital Library
Benitez, D., Moure, J. C., Rexachs, D. I., and Luque E. 2006. Evaluation of the field-prorammable cache: Performance and energy consumption, In Proceedings of the 3rd Conference on Computing Frontiers, ACM, New York, NY, 361--372. Google ScholarDigital Library
Biesbrouck, M. V., Sherwood, T., and Calder. B. 2004. A co-phase matrix to guide simultaneous multithreading simulation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Washington, DC, 45--56. Google ScholarDigital Library
Binkert, N. L., Dreslinski, R. G., Hsu, L. R., Lim, K. T., Saidi, A. G., and Reinhardt, S. K. 2006. The M5 simulator: Modeling networked systems. IEEE Micro. 26, 4, 52-60. Google ScholarDigital Library
Bohr, M. T., Chau, R. S., Ghani, T., and Mistry, K. 2007. The high-k solution, IEEE Spectrum. Google ScholarDigital Library
Brehob, M. and Enbody, R. J. 1996. An analytical model of locality and caching. Tech. rep. Michigan State University, East Lansing, MI.Google Scholar
Brooks, D. M., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of 27th International Symposium on Computer Architecture. IEEE, Washington, DC, 83--94. Google ScholarDigital Library
Brooks, D. M., Bose, P., Srinivasan, V., Gschwind, M., Emma, P., and Rosenfield, M. 2003. New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors. IBM J. Res. Develop. 47, 5--6, 653--670. Google ScholarDigital Library
Chandra, D., Guo, F., Kim, S., and Solihin, Y. 2005. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. IEEE, Washington, DC, 340--351. Google ScholarDigital Library
Chang, J. and Sohi, G. 2006. Co-operative caching for chip multiprocessors. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA). IEEE, Washington, DC, 264--276. Google ScholarDigital Library
Chatterjee, S., Parker, E., Hanlon, P. J., and Lebeck, A. R. 2001. Exact analysis of the cache behavior of nested loops. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM, New York, NY, 286--297. Google ScholarDigital Library
Chatterjee, B., Sachdev, M., Hsu, S., Krishnamurthy, R., and Borkar, S. 2003. Effectiveness and scaling trends of leakage control techniques for sub-130 nm CMOS technologies. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED). IEEE, Washington, DC, 122--127. Google ScholarDigital Library
Chen, C. F., Yang, S., Falsafi, B., and Moshovos, A. 2004. Accurate and complexity-effective spatial pattern prediction. In Proceedings of the 10th International Symposium on High Performance Computer Architecture. IEEE, Washington, DC, 276. Google ScholarDigital Library
Chen, J., Dubois, M., and Stenstrom, P. 2007. SimWattch: Integrating complete-system and user-level performance and power simulators, IEEE Micro, 27, 4, 34--48. Google ScholarDigital Library
Chen, X. E. and Aamodt, T. M. 2009. A first-order fine-grained multithreaded throughput model. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, Washington, DC, 329--340.Google Scholar
Chen, J., Annavaram, M., and Dubois, M. 2009. SlackSim: A platform for parallel simulation of CMPs on CMPs. ACM SIGARCH Comput. Architect. News. 37, 2, 20--29. Google ScholarDigital Library
Chidester, M. C. and George, A. D. 2002. Parallel simulation of chip-multiprocessor architectures. ACM Trans. Model. Comput. Simul. (TOMACS) 12, 3, 176--200. Google ScholarDigital Library
Chiou, D., Chiouy, D., Rudolph, L., Devadas, S., and Ang, B. S. 2000. Dynamic cache partitioning via columnization. Computation Structures Group Memo 430. Massachusetts Institute of Technology.Google Scholar
Cho, S. and Jin, L. 2006. Managing distributed, shared L2 caches through OS-Level page allocation. In Proceedings of the ACM/IEEE International Symposium on Microarchitectures (MICRO). IEEE, Washington, DC, 455--468 Google ScholarDigital Library
Cmelik, B. and Keppel, D. 1994. SHADE: A fast instruction-set simulator for execution profiling. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. ACM, New York, NY, 128--137. Google ScholarDigital Library
Conte, T. M., Hirsch, M. A., and Hwu, W. W. 1998. Combining trace sampling with single pass methods for efficient cache simulation. IEEE Trans. Comput. 47, 6, 714--720. Google ScholarDigital Library
Dean, J., Hicks, J. E., Waldspurger, C. A., Weihl, W. E., and Chrysos, G. 1997. ProfileMe: Hardware support for instruction-level profiling in out-of-order processors. In Proceedings of the 30th Anual ACM/IEEE International Symposium on Microarchitecture. IEEE, Washington, DC, 292--302. Google ScholarDigital Library
Dhodapkar, A. S. and Smith, J. E. 2002. Managing multi-configuration hardware via dynamic working set analysis. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, Washington, DC, 233--244. Google ScholarDigital Library
Dhodapkar, A. S. and Smith, J. E. 2003. Comparing program phase detection techniques. In Proceedings of the International Symposium on Microarchitecture. IEEE, Washington, DC, 217. Google ScholarDigital Library
Díaz, J., Hidalgo, J. I., Fernández, F., Garnica, O., and López, S. 2009. Improving SMT performance: An application of genetic algorithms to configure resizable caches. In Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference. ACM, New York, NY, 2029--2034. Google ScholarDigital Library
Ding, C. and Zhong, Y. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 245--257. Google ScholarDigital Library
Ding, C. and Chilimbi, T. 2009. A composable model for analyzing locality of multi-threaded programs. Tech. rep. MSR-TR-2009-107, Microsoft.Google Scholar
Dropsho, S., Buyuktosunoglu, A., Balasubramonian, R., Albonesi, D. H., Dwarkadas, S., Semeraro, G., Magklis, G., and Scott, M. L. 2002. Integrating adaptive on-chip storage structures for reduced dynamic power. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 141--152. Google ScholarDigital Library
Dropsho, S., Kursun, V., Albonesi, D. H., Dwarkadas, S., and Friedman, E. G. 2002. Managing static leakage energy in microprocessor functional units. In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'35). IEEE, Los Alamitos, CA, 321--332. Google ScholarDigital Library
Duesterwald, E., Cascaval, C., and Dwarkadas, S. 2003. Characterizing and predicting program behavior and its variability. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 220--231. Google ScholarDigital Library
Dybdahl H. and Stenstrom, P. 2007. An adaptive shared/private nuca cache partitioning scheme for chip multiprocessors. In Proceedings of the Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 2--12. Google ScholarDigital Library
Edler, J. and Hill, M. D. 1998. Dinero IV trace-driven uniprocessor cache simulator. http://www.cs.wisc.edu/&sim;markhill/DineroIV.Google Scholar
Edmondon, J., Rubinfeld, P. I., Bannon P. J., Benschneider, B. J., Bernstein, D., Castelino, R. W., Cooper, E. M., Dever, D. E., Donchin, D. R., Fischer, T. C., Jain, A. K., Mehta, S., Meyer, J. E., Preston, R. P., Rajagopalan, V., Somanathan, C., Taylor, S. A., and Wolrich, G. M. 1995. Internal organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC microprocessor. Digi. Tech. J. Special 10th Anniversary Issue, 7, 1, 119--135. Google ScholarDigital Library
Eeckhout, L., Nussbaum, S., Smith, J. E., and Bosschere, K. D. 2003. Statistical simulation: Adding efficiency to the computer designer's toolbox. IEEE Micro. 23, 5, 26--38. Google ScholarDigital Library
Eeckhout, L. 2010. Computer architecture performance evaluation methods. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, San Rafael, CA. Google ScholarDigital Library
Eklov, D., Black-Schaffer, D., and Hagersten, E. 2011. Fast modeling of shared cache in multicore systems. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers. ACM New York, NY, 147--157. Google ScholarDigital Library
Falcón, A., Faraboschi, P., and Ortega. D. 2008. An adaptive synchronization technique for parallel simulation of networked clusters. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 22--31. Google ScholarDigital Library
Fang, C., Carr, S., Onder, S., and Wang, Z. 2004. Reuse-distance-based miss-rate prediction on a per instruction basis. In Proceedings of the Workshop on Memory System Performance. ACM, New York, NY, 60--68. Google ScholarDigital Library
Flautner, K., Kim, N. S., Matin, S., Blaauw, D., and Mudge, T. 2002. Drowsy caches: Simple techniques for reducing leakage power, In Proceedings of the 29th Annual International Symposium on Computer Architecture. ACM, New York, NY, 148--157. Google ScholarDigital Library
Genbrugge, D., Eeckhout, L., and Bosschere K. D. 2006. Accurate memory data flow modeling in statistical simulation. In Proceedings of the 20th Annual International Conference of Supercomputing. ACM, New York, NY. Google ScholarDigital Library
Ghosh, S., Martonosi, M., and Malik, S. 1999. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Program. Lang. Syst. 21, 4, 703--746. Google ScholarDigital Library
Ghosh, A., and Givargis, T. 2004. Cache optimization for embedded processor cores: An analytical approach. ACM Trans. Design Autom. Electron. Syst. 9, 4, 419--440. Google ScholarDigital Library
Gluhovsky, I. and O'Krafka, B. 2005. Comprehensive multiprocessor cache miss rate generation using multivariate models. ACM Trans. Comput. Syst. 23, 2. 111--145. Google ScholarDigital Library
Goldschmidt, S. and Hennessey, J. 1992. The accuracy of trace-driven simulations of multiprocessors. Tech rep. CSL-TR-92-546, Stanford University. Google ScholarDigital Library
Gordon-Ross, A., Vahid, F., and Dutt, N. 2004. Automatic tuning of two level caches to embedded applications. In Proceedings of the Conference on Design, Automation and Test in Europe. IEEE, Washington, DC. Google ScholarDigital Library
Gordon-Ross, A. and Vahid, F. 2007. A self-tuning configurable cache. In Proceedings of the 44th Anual Design Automation Conference. ACM, New York, NY, 234--237. Google ScholarDigital Library
Gordon-Ross, A., Viana, P., Vahid, F., Najjar, W., and Barros, E. 2007. A one-shot configurable-cache tuner for improved energy and performance. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, San Jose, CA, 755--760. Google ScholarDigital Library
Gordon-Ross, A., Lau, J., and Calder, B. 2008. Phase-based cache reconfiguration for a highly-configurable two-level cache hierarchy. In Proceedings of the 18th ACM Great Lakes Symposium on VLSI. ACM, New York, NY, 379--382. Google ScholarDigital Library
Gordon-Ross, A., Vahid, F., and Dutt, N. 2009. Fast configurable-cache tuning with a unified second-level cache. IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 17, 1, 80--91. Google ScholarDigital Library
Hamerly. G., Perelman, E., Lau, J., and Calder, B. 2005. SimPoint 3.0: Faster and more flexible program analysis. J. Instruct.-Level Parall. 7, 1--28.Google Scholar
Hanson, H., Hrishikesh, M. S., Agarwal, V., Keckler, S. W., and Burger, D. 2003. Static energy reduction techniques for microprocessor caches. IEEE Trans. Very Large Scale Integr. Syst. 11, 3, 303--313. Google ScholarDigital Library
Hardavellas, N., Ferdman, M., Falsafi, B., and Ailamaki, A. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA). ACM, New York, NY, 184--195. Google ScholarDigital Library
Harper, J. S., Kerbyson, D. J., and Nudd, G. R. 1999. Analytical modeling of set-associative cache behavior. IEEE Trans. Comput. 48, 10, 1009--1024. Google ScholarDigital Library
Heidelberger, P. and Stone, H. S. 1990. Parallel trace-driven cache simulation by time partitioning. In Proceedings of the 22nd Conference on Winter Simulation. IEEE, Piscataway, NJ, 734--737. Google ScholarDigital Library
Hill, M. D. and Smith, A. J. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12, 1612--1630. Google ScholarDigital Library
Hind, M., Rjan, V., and Sweeney, P. 2003. Phase shift detection: A problem classification. Tech. rep., IBM.Google Scholar
Hsu, L., Reinhardt, S., Iyer, R., and Makineni, S. 2006. Communist, utilitarian, and capitalist cache policies on CMPs: Caches as a shared resource. In Proceedings of the International Conference on Parallel Architectures and Computation Technologies (PACT). ACM, New York, NY, 13--22. Google ScholarDigital Library
Hu, J. S., Nadgir, A., Vijaykrishnan, N., Irwin, M. J., Kandemir, M. 2003. Exploiting program hotspots and code sequentiality for instruction cache leakage management. In Proceedings of the International Symposium on Low Power Electronics and Design (ISPLED). ACM, New York, NY, 402--407. Google ScholarDigital Library
Huang, M., Renau, J., and Torrellas, J. 2003. Positional adaptation of processors: Application to energy reduction. In Proceedings of the 30th Anual International Symposium on Computer Architecture. ACM, New York, NY, 157--168. Google ScholarDigital Library
Huang, W., Ghosh, S., Velusamy, S., Sankaranarayanan, K., Skadron, K., and Stan, M. R. 2006. HotSpot: A compact thermal modeling method for CMOS VLSI Systems. IEEE Trans. Very Large Scale Integ. Syst. 14, 5, 501--513. Google ScholarDigital Library
Huang, C., Sheldon, D., and Vahid, F. 2008. Dynamic tuning of configurable architectures: The AWW online algorithm. In Proceedings of the 6th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, NY, 97--102. Google ScholarDigital Library
Huh, J., Kim, C., Shafi, H., Zhang, L., Burger, D., and Keckler, S. 2007. A NUCA substrate for flexible CMP cache sharing. IEEE Trans. Parallel Distribu. Syst. 18, 8, 1028--1040. Google ScholarDigital Library
Hughes, C. J., Pai, V. S., Ranganathan, P., and Adve, S. V. 2002. Rsim: Simulating shared-memory multiprocessors with ILP processors. IEEE Computer 35, 2, 40--49. Google ScholarDigital Library
Inoue, K., Moshnyaga, V., and Murakami, K. 2001. Trends in high-performance, low-power cache memory architectures. IEICE Trans. Electronics 85, 314.Google Scholar
Iyer, R. 2003. On modeling and analyzing cache hierarchies using CASPER. In Proceedings of the 11th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, Washington, DC, 182--187.Google ScholarCross Ref
Iyer, R. 2004. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In Proceedings of the 18th Annual International Conference on Supercomputing. ACM, New York, NY, 257--266. Google ScholarDigital Library
Jaleel, A., Cohn, R. S., Luk, C. K., and Jacob. B. 2008a. CMP&dollar;im: A pinbased on-the-fly multi-core cache simulator. In Proceedings of the 4th Annual Workshop on Modeling Benchmarking and Simulation.Google Scholar
Jaleel, A., Hasenplaugh, W., Qureshi, M., Sebot, J., Steely, Jr. S., and Emer, J. 2008b. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, ACM, New York, NY, 208--219. Google ScholarDigital Library
Janapsatya, A., Lgnjatović A., and Parameswaran, S. 2006. Finding optimal L1 cache configuration for embedded systems. In Proceedings of the Asia and South Pacific Design Automation Conference. IEEE, Piscataway, NJ, 796--801. Google ScholarDigital Library
Janapsatya, A., Lgnjatović, A., Parameswaran, S., and Henkel, J. 2007. Instruction trace compression for rapid instruction cache simulation. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, San Jose, CA, 803--808. Google ScholarDigital Library
Joshi, A., Yi, J. J., Bell, R. H., Jr., Eeckhout, L. John, L., and Lilja, D. 2006. Evaluating the efficacy of statistical simulation for design space exploration. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 70--79.Google Scholar
Kaxiras, S., Hu, Z., and Martonosi, M. 2001. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th International Symposium on Computer Architecture. IEEE, Washington, DC, 240--251. Google ScholarDigital Library
Kaxiras, S. and Martonosi, M. 2008. Computer architecture techniques for power-efficiency. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, San Rafael, CA. Google ScholarDigital Library
Kessler, R. E. and Hill, M. D. 1992. Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst. 10, 4, 338--359. Google ScholarDigital Library
Kihm, J. L. and Connors, D. A. 2005. A mathematical model for accurately balancing co-phase effect in simulated multithreaded systems. In Proceedings of the Workshop on Modeling, Benchmarking and Simulation held in conjunction with ISCA-32.Google Scholar
Kim, N. S., Flautner, K., Blaauw, D., and Mudge, T. 2002. Drowsy instruction caches--leakage power reduction using dynamic voltage scaling. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-35). IEEE, Los Alamitos, CA, 219--230. Google ScholarDigital Library
Kim, S., Chandra, D., and Solihin, Y. 2004. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, Washington, DC, 111--122. Google ScholarDigital Library
Kim, C. H., Kim, J., Mukhopadhyay, S., and Roy. K. 2005. A forward body-biased low-leakage SRAM cache: Device, circuit and architecture considerations. IEEE Trans. Very Large Scale Integr. Syst. 13, 3, 349--357. Google ScholarDigital Library
Laha, S., Patel, J. H., and Iyer R. K. 1988. Accurate low-cost methods for performance evaluation of cache memory systems. IEEE Trans. Comput. 37, 11, 1325--1336. Google ScholarDigital Library
Lau, J., Schoenmackers, S., and Calder, B. 2004. Structures for phase classification. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Piscataway, New Jersey, 57--67. Google ScholarDigital Library
Lau, J., Schoenmackers, S., and Calder, B. 2005. Transition phase classification and prediction. In Proceedings of the International Symposium on High-Performance Computer Architecture. IEEE, Washington, DC, 278--289. Google ScholarDigital Library
Lau, J., Perelman, E., and Calder, B. 2006. Selecting software phase markers with code structure analysis. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE, Washington, DC, 135--146. Google ScholarDigital Library
Lebeck, A. and Wood, D. 1994. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27, 10, 15--26. Google ScholarDigital Library
Lee, H.-H. S., Tyson, G. S., and Farrens, M. K. 2000, Eager Writeback—A technique for improving bandwidth utilization. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, USA. 11--21. Google ScholarDigital Library
Lee, K. Evans, S., and Cho, S. 2009. Accurately approximating superscalar processor performance from traces. In Proceedings of the International Symposium Performance Analysis of Systems and Software (ISPASS), IEEE, Piscataway, New Jersey, 238--248.Google Scholar
Lee, H., Jin, L., Lee, K., Demetriades, S., Moeng, M., and Cho, S. 2010. Two-phase trace-driven simulation (TPTS): A fast multicore processor architecture simulation approach. J. Soft.-Pract. Expe. 40, 3, John Wiley & Sons, Inc. New York, NY, 239--258. Google ScholarDigital Library
Lee, K. and Cho, S. 2011. In-N-Out: Reproducing out-of-order superscalar processor behavior from reduced in-order traces. In Proceedings of the International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, Washington, DC, 126--135. Google ScholarDigital Library
Lee, H., Cho, S., and Childers, B. 2011. CloudCache: Expanding and shrinking private caches. In Proceedings of Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 219--230. Google ScholarDigital Library
Li, L., Kadayif, I., Tsai, Y. F., Vijaykrishnan, N., Kandemir, M., Irwin, M. J., and Sivasubramaniam, A. 2002. Leakage energy management in cache hierarchies. In Proceedings International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 131--140. Google ScholarDigital Library
Li, Y., Parikh, D., Zhang, Y., Sankaranarayanan, K., Skadron, K., and Stan, M. 2004. State-preserving vs. non-state-perserving leakage control in caches. In Proceedings of the Conference on Design, Automation and Test in Europe, IEEE, Washington, DC, 10. Google ScholarDigital Library
Lin, J., Lu, Q., Ding, X., Zhang, Z., Zhang, X., and Sadayappan, P. 2009. Enabling software management for multicore caches with a lightweight hardware support. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, New York, NY. Google ScholarDigital Library
Liu, C., Sivasubramaniam, A., and Kandemir, M. 2004. Organizing the last line of defense before hitting the memory wall for CMPs. In Proceedings of the Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC. 176--185. Google ScholarDigital Library
Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Languages Design and Implementation (PLDI). ACM, New York, NY. 190--200. Google ScholarDigital Library
Magnusson, P. S., Christensson, M., Eskilson, K. J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., and Werner, B. 2002. Simics: A full system simulation platform. Computer. 35, 2. 50--58. Google ScholarDigital Library
Malik, A., Moyer, B., and Cermak, D. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design. ACM, New York, NY, USA. 241--243. Google ScholarDigital Library
Martin, M. M. K., Sorin, D. J., Beckmann, B. M., Marty, M. R., Xu, M., Alameldeen, A. R., Moore, K. E., Hill, M. D., and Wood, D. A. 2005. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. ACM SIGARCH Comput. Architec. News 33, 4. ACM New York, NY, 92--99. Google ScholarDigital Library
Mattson, R. L., Gecsei, J., Slutz, D. R., and Traiger, I. L. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2, 78--117. Google ScholarDigital Library
Meng, Y., Sherwood, T., and Kastner, R. 2005. Exploring the limits of leakage power reduction in caches. ACM Tran. Architec. Code Optim. 2, 3, 221--246. Google ScholarDigital Library
Mihocka, D. and Schwartsman, S. 2008. Virtualization without direct execution or jitting: Designing a portable virtual machine infrastructure. In Proceedings of the Workshop on Architectural and Microarchitectural Support for Binary Translation, held in conjunction with ISCA.Google Scholar
Miller, J. E., Kasture, H., Kurian, G., Gruenwald, C., Beckmann, N., Celio, C., Eastep, J., and Agarwal, A. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the IEEE 16th International Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 1--12.Google Scholar
Mips R4000. Microprocessor user's manual, http://groups.csail.mit.edu/cag/raw/documents/R4400_Uman_book_Ed2.pdf.1994.Google Scholar
Mips32. 4ktm Processor core family software user's manual, http://d3s.mff.cuni.cz/&sim;ceres/sch/osy/download/MIPS32-4K-Manual.pdf.2001.Google Scholar
Montanaro, J., Witek, R. T., and Anne, K. Et Al. 1997. A 160-MHz, 32-b 0.5-W CMOS RISC microprocessor, Dig. Tech. J. 9, 1, 49--62. Google ScholarDigital Library
Namkung, J., Dohyung K., Gupta, R., Kozintsev, I., Bouget, J.-Y., and Dulong, C. 2006. Phase guided sampling for efficient parallel application simulation. In Proceedings of the International Conference Hardware/Software Codesign and System Synthesis (CODES + ISSS). ACM, New York, NY, 187--192. Google ScholarDigital Library
Ortego, P. M. and Sack, P. 2004. SESC: SuperESCalar Simulator. http://iacoma.cs.uiuc.edu/&sim;paulsack/sescdoc/.Google Scholar
Perelman, E., Polito, M., Bouguet, J.-Y., Sampson, J., Calder, B., and Dulong, C. 2006. Detecting phases in parallel applications on shared memory architectures. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, IEEE, Washington, DC, 88--98. Google ScholarDigital Library
Powell, M. D., Yang, S., Falsafi, B., Roy, K., and Vijaykumar, T. N. 2000. Gated-Vdd: A circuit technique to reduce leakage in deep-submicron cache memories. In Proceedings of the International Symposium on Low Power Electronics and Design, ACM, New York, NY, 90--95. Google ScholarDigital Library
Powell, M., Yang, S.-H., Falsafi, B., Roy, K., and Vijaykumar, T. N. 2001. Reducing leakage in a high-performance deep-submicron instruction cache. IEEE Trans. VLSI Syst. 9, 1, 77--89. Google ScholarDigital Library
Qureshi, M. and Patt, Y. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the (MICRO). IEEE, Washington, DC, 423--432. Google ScholarDigital Library
Qureshi, M. K., Jaleel, A., Patt, Y. N., Steely, S. C., and Emer, J. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA), ACM, New York, NY, 381--391. Google ScholarDigital Library
Qureshi, M. K. 2009. Adaptive spill-receive for robust high-performance caching in CMPs. In Proceedings of the 15th International Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 45--54.Google ScholarCross Ref
Rajkumar, R., Lee, C., Lehoczky, J., and Siewiorek, D. 1997. A resource allocation model for QoS management. In Proceedings of the 18th IEEE Real-Time Systems Symposium. IEEE, Washington, DC, 298. Google ScholarDigital Library
Ramaswamy, S. and Yalamanchili, S. 2007. Improving cache efficiency via resizing + remapping. In Proceedings of the 25th International Conference on Computer Design. IEEE, Washington, DC, 47--54.Google Scholar
Ranganathan, P., Adve, S., and Jouppi, N. P. 2000. Reconfigurable caches and their application to media processing. In Proceedings of the 27th Annual International Symposium on Computer Architecture. ACM, New York, NY, 214--224. Google ScholarDigital Library
Rawlins, M. and Gordon-Ross, A. 2011. CPACT -- the conditional parameter adjustment cache tuner for dual-core architectures. In Proceedings of the IEEE International Conference of Computer Design (ICCD). IEEE, Los Alamitos, CA. Google ScholarDigital Library
Rawlins, M. and Gordon-Ross, A. 2012. An application classification guided cache tuning heuristic for multi-core architectures. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, Piscataway, NJ.Google Scholar
Renau, J., Fraguela, B., Tuck, J., Liu, W., Prvulovic, M., Ceze, L., Strauss, K., Sarangi, S., Sack, P., and Montesinos, P. 2005. SESC Simulator. http://sesc.sourceforge.net.Google Scholar
Rico, A., Duran, A., Cabarcas, F., Etsion, Y., Ramirez, A., and Valero, M. 2011. Trace-driven simulation of multithreaded applications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Piscataway, New Jersey, 87--96. Google ScholarDigital Library
Rosenblum, M., Bugnion, E., Devine, S., and Herrod S.A. 1997. Using the SimOS machine simulator to study complex computer systems. ACM Trans. Model. Comput. Simul. 7, 1.78--103. Google ScholarDigital Library
Sanchez, H., Kuttanna, B., Olson, T., Alexander, M., Gerosa, G., Philip, R., and Alvarez, J. 1997. Thermal management system for high performance PowerPC#8482; microprocessors. In Proceedings of the 42nd IEEE International Computer Conference. IEEE, Washington, DC, 325--330. Google ScholarDigital Library
Segars, S. 2001. Low power design techniques for microprocessors. In Proceedings of the International Solid State Circuit Conference.Google Scholar
Shen, X., Zhong, Y., and Ding, C. 2004. Locality phase prediction. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systens. ACM, New York, NY, 165--176. Google ScholarDigital Library
Shen, X., Zhong, Y., and Ding, C. 2005. Phase-based miss rate prediction across program inputs. In Proceedings of the 17th International Workshop on Languages and Compilers for High Performance Computing, Springer, Berlin, Heidelberg, Germany, 42--55. Google ScholarDigital Library
Sherwood, T., Perelman, E., and Calder, B. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques., IEEE, Washington, DC, 3--14. Google ScholarDigital Library
Sherwood, T., Sair, S., and Calder, B. 2003. Phase tracking and prediction. In Proceedings of the 30th Annual International Symposium on Computer Architecture. ACM, New York, NY, 336--349. Google ScholarDigital Library
Sherwood, T., Perelman, E., Hamerly, G., Sair, S., and Calder, B. 2003. Discovering and exploiting program phases. IEEE Micro, IEEE, Los Alamitos, CA, 23, 6, 84--93. Google ScholarDigital Library
Shi, X., Su, F., Peir, J., Xia, Y., and Yang, Z. 2009. Modeling and stack simulation of CMP cache capacity and accessibility. IEEE Trans. Parallel Distrib. Syst. 20, 12, 1752--1763. Google ScholarDigital Library
Shiue, W. and Chakrabarti, C. 2001. Memory design and exploration for low power, embedded systems. The J. VLSI Signal Process. Syst. 29, 3, 167--178. Google ScholarDigital Library
Srikantaiah, S., Kandemir, M., and Irwin, M. 2008. Adaptive set pinning: Managing shared caches in chip multiprocessors. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, New York, NY, 135--144. Google ScholarDigital Library
Srikantaiah, S., Kultursay, E., Zhang, T., Kandemir, M., Irwin, M., and Xie, Y. 2011. MorphCache: A reconfigurable adaptive multi-level cache hierarchy for CMPs. In Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 231--242. Google ScholarDigital Library
Srivastava, A. and Eustace, A. 1994. ATOM: A system for building customized program analysis tools. Tech. rep. 94/2, Western Research Lab, Compaq.Google ScholarDigital Library
Suh, G. E., Rudolph, L., and Devadas, S. 2004. Dynamic partitioning of shared cache memory. J. Supercompu. 28, 1, 7--26. Google ScholarDigital Library
Sugumar, R. and Abraham, S. 1991. Efficient simulation of multiple cache configurations using binomial trees. Tech. rep. CSE-TR-111-91.Google Scholar
Sugumar, R. A. 1993. Multi-reconfiguration simulation algorithms for the evaluation of computer architecture designs. Ph.D. Thesis, University of Michigan, Ann Arbor, MI. Google ScholarDigital Library
Tarjan, D., Thoziyoor, S., and Jouppi, N. P. 2006. CACTI 4.0, Hewlett-Packard Laboratories Technical Report # HPL-2006-86.Google Scholar
Thompson, J. G., and Smith, A. J. 1989. Efficient (stack) algorithms for analysis of write-back and sector memories. ACM Transactions on Computer Systems, 7, 1, 78--117. Google ScholarDigital Library
Ishihara, T. and Fallah, F. 2005. A non-uniform cache architecture for low power system design. IN Proceedings of the International Symposium on Low Power Electronics and Design. ACM, New York, NY, 363--368. Google ScholarDigital Library
Uhlig, R. A. and Mudge, T.N. 1997. Trace-driven memory simulation: A survey. ACM Comput. Surv. 29, 2, 128--170. Google ScholarDigital Library
Varadarajan, K., Nandy, S., Sharda, V., Bharadwaj, A., Iyer, R., Makineni, S., and Newell, D. 2006. Molecular caches: A caching structure for dynamic creation of application-specific heterogeneous cache regions. In Proceedings of the (MICRO), IEEE, Los Alamitos, CA, 433--442. Google ScholarDigital Library
Veidenbaum, A., Tang, W., Gupta, R., Nicolau, A., and Ji. X. 1999. Adapting cache line size to application behavior. In Proceedings of the International Conference on Supercomputing. ACM, New York, NY, 145--154. Google ScholarDigital Library
Venkatachalam, V. and Franz, M. 2005. Power reduction techniques for microprocessor systems. ACM Comput. Surv. 37, 3, 195--237. Google ScholarDigital Library
Vera, X., Bermudo, N., Llosa, J., and Gonzalez, A. 2004. A fast and accurate framework to analyze and optimize cache memory behavior. ACM Trans. Program. Lang. Syst. 26, 2, 263--300. Google ScholarDigital Library
Viana, P., Gordon-Ross, A., Keogh, E., Barros, E., and Vahid, F. 2006. Configurable cache subsetting for fast cache tuning. In Proceedings of the ACM Design Automation Conference. ACM, New York, NY, 695--900. Google ScholarDigital Library
Viana, P., Gordon-Ross, A., Baros, E., and Vahid, F. 2008. A table-based method for single-Pass cache optimization. In Proceedings of the 18th ACM Great Lakes Symposium on VLSI. ACM, New York, NY, 71--76. Google ScholarDigital Library
Vivekanandarajah, K., Sirkanthan, T., and Clarke, C. T. 2006. Profile directed instruction cache tuning for embedded systems. In Proceedings of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures. IEEE, Washington, DC, 227. Google ScholarDigital Library
Wan, H., Gao, X., Long, X., and Wang, Z. 2009. GCSim: A GPU-based trace-driven simulator for multi-level cache. Advan. Parallel Process. Technol. 177--190. Google ScholarDigital Library
Wenisch, T. F., Wunderlich, R. E., Ferdman, M., Ailamaki, A., Falsafi, B., and Hoe, J. C. 2006. SimFlex:Statistical sampling of computer system simulation. IEEE Micro 26, 4, 18--31. Google ScholarDigital Library
Witchell, E. and Rosenblum, M. 1996. Embra: Fast and flexible machine simulation. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM, New York, NY, 68--79. Google ScholarDigital Library
Wunderlich, R. E., Wenisch, T. F., Falsafi, B. and Hoe, J. C. 2003. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA), IEEE, Washington, DC, 84--95. Google ScholarDigital Library
Xiang, X., Bao, B., Bai, T., Ding, C., and Chilimbi, T. 2011. All-window profiling and composable models of cache sharing. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming. ACM New York, NY, 91--102. Google ScholarDigital Library
Xie, Y. and Loh, G. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. ACM SIGARCH Comput. Architec. News 37, 3, ACM New York, NY, 174--183. Google ScholarDigital Library
Xu, C., Chen, X. Dick, R. P., and Mao, Z. M. 2010. Cache contention and application performance prediction for multi-core systems. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS). IEEE, Piscataway, New Jersey, 76--86.Google Scholar
Yeh, T. and Reinman, G. 2005. Fast and fair: Data-stream quality of service. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). ACM New York, NY, 237--248. Google ScholarDigital Library
Yourst, M. T. 2007. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), IEEE, Piscataway, NJ, 23--34.Google ScholarCross Ref
Zang, W. and Gordon-Ross, A. 2011. T-SPaCS - a two-level single-pass cache simulation methodology. In Proceedings of the 16th Asia and South Pacific Design Automation Conference. IEEE, Piscataway, NJ, 419--424. Google ScholarDigital Library
Zhang, W., Hu, J. S., Degalahal, V., Kandemir, M., Vijaykrishnan, N., and Irwin, M. J. 2002. Compiler-directed instruction cache leakage optimization. In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-35). IEEE, Los Alamitos, CA, 208--218. Google ScholarDigital Library
Zhang, Y., Parikh, D., Sankaranarayanan, K. Skadron, K., and Stan, M. 2003. HotLeakage: A temperature-aware model of subthreshold and gate leakage for architects. Tech. rep. CS-2003-05, Department of Computer Science, University of Virginia, Charlottesville, VA.Google Scholar
Zhang, C., Vahid, F., and Lysecky, R. 2004. A self-tuning cache architecture for embedded systems. Special issue on Dynamically Adaptable Embedded System. ACM Trans. Embed. Comput. Syst. 3, 2, 1--19. Google ScholarDigital Library
Zhou, H., Toburen, M. C., Rotenberg, E., and Conte, T. 2003. Adaptive mode control: A static-power-efficient cache design. ACM Trans. Embed. Comput. Syst. 2, 3, 347--372. Google ScholarDigital Library
Zhong, Y., Dropsho, S., and Ding, C. 2003. Miss rate prediction across all program inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 91--101. Google ScholarDigital Library

Index Terms

A survey on cache tuning from a power/energy perspective
1. General and reference
  1. Document types
    1. Surveys and overviews
2. Hardware
  1. Hardware validation
  2. Integrated circuits
    1. Semiconductor memory
      1. Dynamic memory

Recommendations

Energy-efficient synonym data detection and consistency for virtual cache

The cache memory consumes a large proportion of the energy used by a processor. In the on-chip cache, the translation lookaside buffer (TLB) accounts for 20-50% of energy consumption of the on-chip cache. To reduce energy consumption caused by TLB ...
Read More
A self-tuning configurable cache
DAC '07: Proceedings of the 44th annual Design Automation Conference

The memory hierarchy of a system can consume up to 50% of microprocessor system power. Previous work has shown that tuning a configurable cache to a particular application can reduce memory subsystem energy by 62% on average. We introduce a self-tuning ...
Read More
Minimizing energy for wireless web access with bounded slowdown
MobiCom '02: Proceedings of the 8th annual international conference on Mobile computing and networking

On many battery-powered mobile computing devices, the wireless network is a significant contributor to the total energy consumption. In this paper, we investigate the interaction between energy-saving protocols and TCP performance for Web like ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Computing Surveys Volume 45, Issue 3
June 2013
575 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/2480741
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 July 2013
- Accepted: 1 February 2012
- Revised: 1 November 2011
- Received: 1 May 2011
Published in csur Volume 45, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cache tuning
cache configuration
cache partitioning
energy saving
power saving
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 39
  Total Citations
  View Citations
- 1,222
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A survey on cache tuning from a power/energy perspective

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Energy-efficient synonym data detection and consistency for virtual cache

A self-tuning configurable cache

Minimizing energy for wireless web access with bounded slowdown

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A survey on cache tuning from a power/energy perspective

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Energy-efficient synonym data detection and consistency for virtual cache

A self-tuning configurable cache

Minimizing energy for wireless web access with bounded slowdown

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media