Abstract
Low power and/or energy consumption is a requirement not only in embedded systems that run on batteries or have limited cooling capabilities, but also in desktop and mainframes where chips require costly cooling techniques. Since the cache subsystem is typically the most power/energy-consuming subsystem, caches are good candidates for power/energy optimizations, and therefore, cache tuning techniques are widely researched. This survey focuses on state-of-the-art offline static and online dynamic cache tuning techniques and summarizes the techniques' attributes, major challenges, and potential research trends to inspire novel ideas and future research avenues.
- Albonesi, D. H. 1999. Selective cache way: On-demand cache resource allocation. In Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture. IEEE, Washington, DC, 248--259. Google ScholarDigital Library
- Ammons, G., Ball, T., and Larus, J. R. 1997. Exploiting hardware performance counters with flow and context sensitive profiling, In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 85--96. Google ScholarDigital Library
- Anderson, J. M., Berc, L. M., Dean, J., Ghemawat, S., Henzinger M. R., Leung S. A., Sites, R. L., Vandevoorde, M. T., Waldspurger C. A., and Weihl W. E. 1997. Continuous profiling: Where have all the cycles gone? ACM Trans. Comput. Syst. 15, 4, 357--390. Google ScholarDigital Library
- Austin, T., Larson, E., and Ernst, D. 2002. SimpleScalar: An infrastructure for comput. system modeling. IEEE Comput. 35, 2, 59--67. Google ScholarDigital Library
- Awasthi, M., Sudan, K., Balasubramonian, R., and Carter, J. 2009. Dynamic hardware-assisted software-controlled page placement to manage capacity allocation and sharing within large caches. In Proceedings of Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 250--261.Google Scholar
- Balasubramonian, R., Albonesi, D., Buyuktosunoglu, A. and Dwarkadas, S. 2000. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture, ACM, New York, NY, 245--257. Google ScholarDigital Library
- Balasubramonian, R., Jouppi, N. P., and Muralimanohar, N. 2011. Multi-core cache hierarchies. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, San Rafael, CA. Google ScholarDigital Library
- Beckmann, B., Marty, M., and Wood, D. 2006. ASR: Adaptive selective replication for CMP caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, Los Alamitos, CA, 443--454. Google ScholarDigital Library
- Bedichek, R. 2004. SimNow: Fast platform simulation purely in software. In Proceedings of the Symposium on High Performance Chips (HOT CHIPS).Google Scholar
- Bellard, F. 2005. QEMU, a fast and portable dynamic translator, USENIX' 05 Technical Program. Google ScholarDigital Library
- Benitez, D., Moure, J. C., Rexachs, D. I., and Luque E. 2006. Evaluation of the field-prorammable cache: Performance and energy consumption, In Proceedings of the 3rd Conference on Computing Frontiers, ACM, New York, NY, 361--372. Google ScholarDigital Library
- Biesbrouck, M. V., Sherwood, T., and Calder. B. 2004. A co-phase matrix to guide simultaneous multithreading simulation. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Washington, DC, 45--56. Google ScholarDigital Library
- Binkert, N. L., Dreslinski, R. G., Hsu, L. R., Lim, K. T., Saidi, A. G., and Reinhardt, S. K. 2006. The M5 simulator: Modeling networked systems. IEEE Micro. 26, 4, 52-60. Google ScholarDigital Library
- Bohr, M. T., Chau, R. S., Ghani, T., and Mistry, K. 2007. The high-k solution, IEEE Spectrum. Google ScholarDigital Library
- Brehob, M. and Enbody, R. J. 1996. An analytical model of locality and caching. Tech. rep. Michigan State University, East Lansing, MI.Google Scholar
- Brooks, D. M., Tiwari, V., and Martonosi, M. 2000. Wattch: A framework for architectural-level power analysis and optimizations. In Proceedings of 27th International Symposium on Computer Architecture. IEEE, Washington, DC, 83--94. Google ScholarDigital Library
- Brooks, D. M., Bose, P., Srinivasan, V., Gschwind, M., Emma, P., and Rosenfield, M. 2003. New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors. IBM J. Res. Develop. 47, 5--6, 653--670. Google ScholarDigital Library
- Chandra, D., Guo, F., Kim, S., and Solihin, Y. 2005. Predicting inter-thread cache contention on a chip multi-processor architecture. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. IEEE, Washington, DC, 340--351. Google ScholarDigital Library
- Chang, J. and Sohi, G. 2006. Co-operative caching for chip multiprocessors. In Proceedings of the 33rd Annual International Symposium on Computer Architecture (ISCA). IEEE, Washington, DC, 264--276. Google ScholarDigital Library
- Chatterjee, S., Parker, E., Hanlon, P. J., and Lebeck, A. R. 2001. Exact analysis of the cache behavior of nested loops. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM, New York, NY, 286--297. Google ScholarDigital Library
- Chatterjee, B., Sachdev, M., Hsu, S., Krishnamurthy, R., and Borkar, S. 2003. Effectiveness and scaling trends of leakage control techniques for sub-130 nm CMOS technologies. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED). IEEE, Washington, DC, 122--127. Google ScholarDigital Library
- Chen, C. F., Yang, S., Falsafi, B., and Moshovos, A. 2004. Accurate and complexity-effective spatial pattern prediction. In Proceedings of the 10th International Symposium on High Performance Computer Architecture. IEEE, Washington, DC, 276. Google ScholarDigital Library
- Chen, J., Dubois, M., and Stenstrom, P. 2007. SimWattch: Integrating complete-system and user-level performance and power simulators, IEEE Micro, 27, 4, 34--48. Google ScholarDigital Library
- Chen, X. E. and Aamodt, T. M. 2009. A first-order fine-grained multithreaded throughput model. In Proceedings of the IEEE 15th International Symposium on High Performance Computer Architecture. IEEE, Washington, DC, 329--340.Google Scholar
- Chen, J., Annavaram, M., and Dubois, M. 2009. SlackSim: A platform for parallel simulation of CMPs on CMPs. ACM SIGARCH Comput. Architect. News. 37, 2, 20--29. Google ScholarDigital Library
- Chidester, M. C. and George, A. D. 2002. Parallel simulation of chip-multiprocessor architectures. ACM Trans. Model. Comput. Simul. (TOMACS) 12, 3, 176--200. Google ScholarDigital Library
- Chiou, D., Chiouy, D., Rudolph, L., Devadas, S., and Ang, B. S. 2000. Dynamic cache partitioning via columnization. Computation Structures Group Memo 430. Massachusetts Institute of Technology.Google Scholar
- Cho, S. and Jin, L. 2006. Managing distributed, shared L2 caches through OS-Level page allocation. In Proceedings of the ACM/IEEE International Symposium on Microarchitectures (MICRO). IEEE, Washington, DC, 455--468 Google ScholarDigital Library
- Cmelik, B. and Keppel, D. 1994. SHADE: A fast instruction-set simulator for execution profiling. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems. ACM, New York, NY, 128--137. Google ScholarDigital Library
- Conte, T. M., Hirsch, M. A., and Hwu, W. W. 1998. Combining trace sampling with single pass methods for efficient cache simulation. IEEE Trans. Comput. 47, 6, 714--720. Google ScholarDigital Library
- Dean, J., Hicks, J. E., Waldspurger, C. A., Weihl, W. E., and Chrysos, G. 1997. ProfileMe: Hardware support for instruction-level profiling in out-of-order processors. In Proceedings of the 30th Anual ACM/IEEE International Symposium on Microarchitecture. IEEE, Washington, DC, 292--302. Google ScholarDigital Library
- Dhodapkar, A. S. and Smith, J. E. 2002. Managing multi-configuration hardware via dynamic working set analysis. In Proceedings of the 29th Annual International Symposium on Computer Architecture. IEEE, Washington, DC, 233--244. Google ScholarDigital Library
- Dhodapkar, A. S. and Smith, J. E. 2003. Comparing program phase detection techniques. In Proceedings of the International Symposium on Microarchitecture. IEEE, Washington, DC, 217. Google ScholarDigital Library
- Díaz, J., Hidalgo, J. I., Fernández, F., Garnica, O., and López, S. 2009. Improving SMT performance: An application of genetic algorithms to configure resizable caches. In Proceedings of the 11th Annual Conference Companion on Genetic and Evolutionary Computation Conference. ACM, New York, NY, 2029--2034. Google ScholarDigital Library
- Ding, C. and Zhong, Y. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. ACM, New York, NY, 245--257. Google ScholarDigital Library
- Ding, C. and Chilimbi, T. 2009. A composable model for analyzing locality of multi-threaded programs. Tech. rep. MSR-TR-2009-107, Microsoft.Google Scholar
- Dropsho, S., Buyuktosunoglu, A., Balasubramonian, R., Albonesi, D. H., Dwarkadas, S., Semeraro, G., Magklis, G., and Scott, M. L. 2002. Integrating adaptive on-chip storage structures for reduced dynamic power. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 141--152. Google ScholarDigital Library
- Dropsho, S., Kursun, V., Albonesi, D. H., Dwarkadas, S., and Friedman, E. G. 2002. Managing static leakage energy in microprocessor functional units. In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'35). IEEE, Los Alamitos, CA, 321--332. Google ScholarDigital Library
- Duesterwald, E., Cascaval, C., and Dwarkadas, S. 2003. Characterizing and predicting program behavior and its variability. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 220--231. Google ScholarDigital Library
- Dybdahl H. and Stenstrom, P. 2007. An adaptive shared/private nuca cache partitioning scheme for chip multiprocessors. In Proceedings of the Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 2--12. Google ScholarDigital Library
- Edler, J. and Hill, M. D. 1998. Dinero IV trace-driven uniprocessor cache simulator. http://www.cs.wisc.edu/∼markhill/DineroIV.Google Scholar
- Edmondon, J., Rubinfeld, P. I., Bannon P. J., Benschneider, B. J., Bernstein, D., Castelino, R. W., Cooper, E. M., Dever, D. E., Donchin, D. R., Fischer, T. C., Jain, A. K., Mehta, S., Meyer, J. E., Preston, R. P., Rajagopalan, V., Somanathan, C., Taylor, S. A., and Wolrich, G. M. 1995. Internal organization of the Alpha 21164, a 300-MHz 64-bit Quad-issue CMOS RISC microprocessor. Digi. Tech. J. Special 10th Anniversary Issue, 7, 1, 119--135. Google ScholarDigital Library
- Eeckhout, L., Nussbaum, S., Smith, J. E., and Bosschere, K. D. 2003. Statistical simulation: Adding efficiency to the computer designer's toolbox. IEEE Micro. 23, 5, 26--38. Google ScholarDigital Library
- Eeckhout, L. 2010. Computer architecture performance evaluation methods. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, San Rafael, CA. Google ScholarDigital Library
- Eklov, D., Black-Schaffer, D., and Hagersten, E. 2011. Fast modeling of shared cache in multicore systems. In Proceedings of the 6th International Conference on High Performance and Embedded Architectures and Compilers. ACM New York, NY, 147--157. Google ScholarDigital Library
- Falcón, A., Faraboschi, P., and Ortega. D. 2008. An adaptive synchronization technique for parallel simulation of networked clusters. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 22--31. Google ScholarDigital Library
- Fang, C., Carr, S., Onder, S., and Wang, Z. 2004. Reuse-distance-based miss-rate prediction on a per instruction basis. In Proceedings of the Workshop on Memory System Performance. ACM, New York, NY, 60--68. Google ScholarDigital Library
- Flautner, K., Kim, N. S., Matin, S., Blaauw, D., and Mudge, T. 2002. Drowsy caches: Simple techniques for reducing leakage power, In Proceedings of the 29th Annual International Symposium on Computer Architecture. ACM, New York, NY, 148--157. Google ScholarDigital Library
- Genbrugge, D., Eeckhout, L., and Bosschere K. D. 2006. Accurate memory data flow modeling in statistical simulation. In Proceedings of the 20th Annual International Conference of Supercomputing. ACM, New York, NY. Google ScholarDigital Library
- Ghosh, S., Martonosi, M., and Malik, S. 1999. Cache miss equations: A compiler framework for analyzing and tuning memory behavior. ACM Trans. Program. Lang. Syst. 21, 4, 703--746. Google ScholarDigital Library
- Ghosh, A., and Givargis, T. 2004. Cache optimization for embedded processor cores: An analytical approach. ACM Trans. Design Autom. Electron. Syst. 9, 4, 419--440. Google ScholarDigital Library
- Gluhovsky, I. and O'Krafka, B. 2005. Comprehensive multiprocessor cache miss rate generation using multivariate models. ACM Trans. Comput. Syst. 23, 2. 111--145. Google ScholarDigital Library
- Goldschmidt, S. and Hennessey, J. 1992. The accuracy of trace-driven simulations of multiprocessors. Tech rep. CSL-TR-92-546, Stanford University. Google ScholarDigital Library
- Gordon-Ross, A., Vahid, F., and Dutt, N. 2004. Automatic tuning of two level caches to embedded applications. In Proceedings of the Conference on Design, Automation and Test in Europe. IEEE, Washington, DC. Google ScholarDigital Library
- Gordon-Ross, A. and Vahid, F. 2007. A self-tuning configurable cache. In Proceedings of the 44th Anual Design Automation Conference. ACM, New York, NY, 234--237. Google ScholarDigital Library
- Gordon-Ross, A., Viana, P., Vahid, F., Najjar, W., and Barros, E. 2007. A one-shot configurable-cache tuner for improved energy and performance. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, San Jose, CA, 755--760. Google ScholarDigital Library
- Gordon-Ross, A., Lau, J., and Calder, B. 2008. Phase-based cache reconfiguration for a highly-configurable two-level cache hierarchy. In Proceedings of the 18th ACM Great Lakes Symposium on VLSI. ACM, New York, NY, 379--382. Google ScholarDigital Library
- Gordon-Ross, A., Vahid, F., and Dutt, N. 2009. Fast configurable-cache tuning with a unified second-level cache. IEEE Trans. Very Large Scale Integ. (VLSI) Syst. 17, 1, 80--91. Google ScholarDigital Library
- Hamerly. G., Perelman, E., Lau, J., and Calder, B. 2005. SimPoint 3.0: Faster and more flexible program analysis. J. Instruct.-Level Parall. 7, 1--28.Google Scholar
- Hanson, H., Hrishikesh, M. S., Agarwal, V., Keckler, S. W., and Burger, D. 2003. Static energy reduction techniques for microprocessor caches. IEEE Trans. Very Large Scale Integr. Syst. 11, 3, 303--313. Google ScholarDigital Library
- Hardavellas, N., Ferdman, M., Falsafi, B., and Ailamaki, A. 2009. Reactive NUCA: Near-optimal block placement and replication in distributed caches. In Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA). ACM, New York, NY, 184--195. Google ScholarDigital Library
- Harper, J. S., Kerbyson, D. J., and Nudd, G. R. 1999. Analytical modeling of set-associative cache behavior. IEEE Trans. Comput. 48, 10, 1009--1024. Google ScholarDigital Library
- Heidelberger, P. and Stone, H. S. 1990. Parallel trace-driven cache simulation by time partitioning. In Proceedings of the 22nd Conference on Winter Simulation. IEEE, Piscataway, NJ, 734--737. Google ScholarDigital Library
- Hill, M. D. and Smith, A. J. 1989. Evaluating associativity in CPU caches. IEEE Trans. Comput. 38, 12, 1612--1630. Google ScholarDigital Library
- Hind, M., Rjan, V., and Sweeney, P. 2003. Phase shift detection: A problem classification. Tech. rep., IBM.Google Scholar
- Hsu, L., Reinhardt, S., Iyer, R., and Makineni, S. 2006. Communist, utilitarian, and capitalist cache policies on CMPs: Caches as a shared resource. In Proceedings of the International Conference on Parallel Architectures and Computation Technologies (PACT). ACM, New York, NY, 13--22. Google ScholarDigital Library
- Hu, J. S., Nadgir, A., Vijaykrishnan, N., Irwin, M. J., Kandemir, M. 2003. Exploiting program hotspots and code sequentiality for instruction cache leakage management. In Proceedings of the International Symposium on Low Power Electronics and Design (ISPLED). ACM, New York, NY, 402--407. Google ScholarDigital Library
- Huang, M., Renau, J., and Torrellas, J. 2003. Positional adaptation of processors: Application to energy reduction. In Proceedings of the 30th Anual International Symposium on Computer Architecture. ACM, New York, NY, 157--168. Google ScholarDigital Library
- Huang, W., Ghosh, S., Velusamy, S., Sankaranarayanan, K., Skadron, K., and Stan, M. R. 2006. HotSpot: A compact thermal modeling method for CMOS VLSI Systems. IEEE Trans. Very Large Scale Integ. Syst. 14, 5, 501--513. Google ScholarDigital Library
- Huang, C., Sheldon, D., and Vahid, F. 2008. Dynamic tuning of configurable architectures: The AWW online algorithm. In Proceedings of the 6th IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis. ACM, New York, NY, 97--102. Google ScholarDigital Library
- Huh, J., Kim, C., Shafi, H., Zhang, L., Burger, D., and Keckler, S. 2007. A NUCA substrate for flexible CMP cache sharing. IEEE Trans. Parallel Distribu. Syst. 18, 8, 1028--1040. Google ScholarDigital Library
- Hughes, C. J., Pai, V. S., Ranganathan, P., and Adve, S. V. 2002. Rsim: Simulating shared-memory multiprocessors with ILP processors. IEEE Computer 35, 2, 40--49. Google ScholarDigital Library
- Inoue, K., Moshnyaga, V., and Murakami, K. 2001. Trends in high-performance, low-power cache memory architectures. IEICE Trans. Electronics 85, 314.Google Scholar
- Iyer, R. 2003. On modeling and analyzing cache hierarchies using CASPER. In Proceedings of the 11th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, Washington, DC, 182--187.Google ScholarCross Ref
- Iyer, R. 2004. CQoS: A framework for enabling QoS in shared caches of CMP platforms. In Proceedings of the 18th Annual International Conference on Supercomputing. ACM, New York, NY, 257--266. Google ScholarDigital Library
- Jaleel, A., Cohn, R. S., Luk, C. K., and Jacob. B. 2008a. CMP$im: A pinbased on-the-fly multi-core cache simulator. In Proceedings of the 4th Annual Workshop on Modeling Benchmarking and Simulation.Google Scholar
- Jaleel, A., Hasenplaugh, W., Qureshi, M., Sebot, J., Steely, Jr. S., and Emer, J. 2008b. Adaptive insertion policies for managing shared caches. In Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, ACM, New York, NY, 208--219. Google ScholarDigital Library
- Janapsatya, A., Lgnjatović A., and Parameswaran, S. 2006. Finding optimal L1 cache configuration for embedded systems. In Proceedings of the Asia and South Pacific Design Automation Conference. IEEE, Piscataway, NJ, 796--801. Google ScholarDigital Library
- Janapsatya, A., Lgnjatović, A., Parameswaran, S., and Henkel, J. 2007. Instruction trace compression for rapid instruction cache simulation. In Proceedings of the Conference on Design, Automation and Test in Europe. EDA Consortium, San Jose, CA, 803--808. Google ScholarDigital Library
- Joshi, A., Yi, J. J., Bell, R. H., Jr., Eeckhout, L. John, L., and Lilja, D. 2006. Evaluating the efficacy of statistical simulation for design space exploration. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. 70--79.Google Scholar
- Kaxiras, S., Hu, Z., and Martonosi, M. 2001. Cache decay: Exploiting generational behavior to reduce cache leakage power. In Proceedings of the 28th International Symposium on Computer Architecture. IEEE, Washington, DC, 240--251. Google ScholarDigital Library
- Kaxiras, S. and Martonosi, M. 2008. Computer architecture techniques for power-efficiency. Synthesis Lectures on Computer Architecture. Morgan & Claypool Publishers, San Rafael, CA. Google ScholarDigital Library
- Kessler, R. E. and Hill, M. D. 1992. Page placement algorithms for large real-indexed caches. ACM Trans. Comput. Syst. 10, 4, 338--359. Google ScholarDigital Library
- Kihm, J. L. and Connors, D. A. 2005. A mathematical model for accurately balancing co-phase effect in simulated multithreaded systems. In Proceedings of the Workshop on Modeling, Benchmarking and Simulation held in conjunction with ISCA-32.Google Scholar
- Kim, N. S., Flautner, K., Blaauw, D., and Mudge, T. 2002. Drowsy instruction caches--leakage power reduction using dynamic voltage scaling. In Proceedings of the 33rd Annual ACM/IEEE International Symposium on Microarchitecture (MICRO-35). IEEE, Los Alamitos, CA, 219--230. Google ScholarDigital Library
- Kim, S., Chandra, D., and Solihin, Y. 2004. Fair cache sharing and partitioning in a chip multiprocessor architecture. In Proceedings of the 13th International Conference on Parallel Architectures and Compilation Techniques (PACT). IEEE, Washington, DC, 111--122. Google ScholarDigital Library
- Kim, C. H., Kim, J., Mukhopadhyay, S., and Roy. K. 2005. A forward body-biased low-leakage SRAM cache: Device, circuit and architecture considerations. IEEE Trans. Very Large Scale Integr. Syst. 13, 3, 349--357. Google ScholarDigital Library
- Laha, S., Patel, J. H., and Iyer R. K. 1988. Accurate low-cost methods for performance evaluation of cache memory systems. IEEE Trans. Comput. 37, 11, 1325--1336. Google ScholarDigital Library
- Lau, J., Schoenmackers, S., and Calder, B. 2004. Structures for phase classification. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Piscataway, New Jersey, 57--67. Google ScholarDigital Library
- Lau, J., Schoenmackers, S., and Calder, B. 2005. Transition phase classification and prediction. In Proceedings of the International Symposium on High-Performance Computer Architecture. IEEE, Washington, DC, 278--289. Google ScholarDigital Library
- Lau, J., Perelman, E., and Calder, B. 2006. Selecting software phase markers with code structure analysis. In Proceedings of the International Symposium on Code Generation and Optimization. IEEE, Washington, DC, 135--146. Google ScholarDigital Library
- Lebeck, A. and Wood, D. 1994. Cache profiling and the SPEC benchmarks: A case study. IEEE Computer, 27, 10, 15--26. Google ScholarDigital Library
- Lee, H.-H. S., Tyson, G. S., and Farrens, M. K. 2000, Eager Writeback—A technique for improving bandwidth utilization. In Proceedings of the 33rd Annual IEEE/ACM International Symposium on Microarchitecture. ACM, New York, NY, USA. 11--21. Google ScholarDigital Library
- Lee, K. Evans, S., and Cho, S. 2009. Accurately approximating superscalar processor performance from traces. In Proceedings of the International Symposium Performance Analysis of Systems and Software (ISPASS), IEEE, Piscataway, New Jersey, 238--248.Google Scholar
- Lee, H., Jin, L., Lee, K., Demetriades, S., Moeng, M., and Cho, S. 2010. Two-phase trace-driven simulation (TPTS): A fast multicore processor architecture simulation approach. J. Soft.-Pract. Expe. 40, 3, John Wiley & Sons, Inc. New York, NY, 239--258. Google ScholarDigital Library
- Lee, K. and Cho, S. 2011. In-N-Out: Reproducing out-of-order superscalar processor behavior from reduced in-order traces. In Proceedings of the International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS). IEEE, Washington, DC, 126--135. Google ScholarDigital Library
- Lee, H., Cho, S., and Childers, B. 2011. CloudCache: Expanding and shrinking private caches. In Proceedings of Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 219--230. Google ScholarDigital Library
- Li, L., Kadayif, I., Tsai, Y. F., Vijaykrishnan, N., Kandemir, M., Irwin, M. J., and Sivasubramaniam, A. 2002. Leakage energy management in cache hierarchies. In Proceedings International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 131--140. Google ScholarDigital Library
- Li, Y., Parikh, D., Zhang, Y., Sankaranarayanan, K., Skadron, K., and Stan, M. 2004. State-preserving vs. non-state-perserving leakage control in caches. In Proceedings of the Conference on Design, Automation and Test in Europe, IEEE, Washington, DC, 10. Google ScholarDigital Library
- Lin, J., Lu, Q., Ding, X., Zhang, Z., Zhang, X., and Sadayappan, P. 2009. Enabling software management for multicore caches with a lightweight hardware support. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. ACM, New York, NY. Google ScholarDigital Library
- Liu, C., Sivasubramaniam, A., and Kandemir, M. 2004. Organizing the last line of defense before hitting the memory wall for CMPs. In Proceedings of the Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC. 176--185. Google ScholarDigital Library
- Luk, C.-K., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, V. J., and Hazelwood, K. 2005. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN Conference on Programming Languages Design and Implementation (PLDI). ACM, New York, NY. 190--200. Google ScholarDigital Library
- Magnusson, P. S., Christensson, M., Eskilson, K. J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., and Werner, B. 2002. Simics: A full system simulation platform. Computer. 35, 2. 50--58. Google ScholarDigital Library
- Malik, A., Moyer, B., and Cermak, D. 2000. A low power unified cache architecture providing power and performance flexibility. In Proceedings of the International Symposium on Low Power Electronics and Design. ACM, New York, NY, USA. 241--243. Google ScholarDigital Library
- Martin, M. M. K., Sorin, D. J., Beckmann, B. M., Marty, M. R., Xu, M., Alameldeen, A. R., Moore, K. E., Hill, M. D., and Wood, D. A. 2005. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. ACM SIGARCH Comput. Architec. News 33, 4. ACM New York, NY, 92--99. Google ScholarDigital Library
- Mattson, R. L., Gecsei, J., Slutz, D. R., and Traiger, I. L. 1970. Evaluation techniques for storage hierarchies. IBM Syst. J. 9, 2, 78--117. Google ScholarDigital Library
- Meng, Y., Sherwood, T., and Kastner, R. 2005. Exploring the limits of leakage power reduction in caches. ACM Tran. Architec. Code Optim. 2, 3, 221--246. Google ScholarDigital Library
- Mihocka, D. and Schwartsman, S. 2008. Virtualization without direct execution or jitting: Designing a portable virtual machine infrastructure. In Proceedings of the Workshop on Architectural and Microarchitectural Support for Binary Translation, held in conjunction with ISCA.Google Scholar
- Miller, J. E., Kasture, H., Kurian, G., Gruenwald, C., Beckmann, N., Celio, C., Eastep, J., and Agarwal, A. 2010. Graphite: A distributed parallel simulator for multicores. In Proceedings of the IEEE 16th International Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 1--12.Google Scholar
- Mips R4000. Microprocessor user's manual, http://groups.csail.mit.edu/cag/raw/documents/R4400_Uman_book_Ed2.pdf.1994.Google Scholar
- Mips32. 4ktm Processor core family software user's manual, http://d3s.mff.cuni.cz/∼ceres/sch/osy/download/MIPS32-4K-Manual.pdf.2001.Google Scholar
- Montanaro, J., Witek, R. T., and Anne, K. Et Al. 1997. A 160-MHz, 32-b 0.5-W CMOS RISC microprocessor, Dig. Tech. J. 9, 1, 49--62. Google ScholarDigital Library
- Namkung, J., Dohyung K., Gupta, R., Kozintsev, I., Bouget, J.-Y., and Dulong, C. 2006. Phase guided sampling for efficient parallel application simulation. In Proceedings of the International Conference Hardware/Software Codesign and System Synthesis (CODES + ISSS). ACM, New York, NY, 187--192. Google ScholarDigital Library
- Ortego, P. M. and Sack, P. 2004. SESC: SuperESCalar Simulator. http://iacoma.cs.uiuc.edu/∼paulsack/sescdoc/.Google Scholar
- Perelman, E., Polito, M., Bouguet, J.-Y., Sampson, J., Calder, B., and Dulong, C. 2006. Detecting phases in parallel applications on shared memory architectures. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium, IEEE, Washington, DC, 88--98. Google ScholarDigital Library
- Powell, M. D., Yang, S., Falsafi, B., Roy, K., and Vijaykumar, T. N. 2000. Gated-Vdd: A circuit technique to reduce leakage in deep-submicron cache memories. In Proceedings of the International Symposium on Low Power Electronics and Design, ACM, New York, NY, 90--95. Google ScholarDigital Library
- Powell, M., Yang, S.-H., Falsafi, B., Roy, K., and Vijaykumar, T. N. 2001. Reducing leakage in a high-performance deep-submicron instruction cache. IEEE Trans. VLSI Syst. 9, 1, 77--89. Google ScholarDigital Library
- Qureshi, M. and Patt, Y. 2006. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches. In Proceedings of the (MICRO). IEEE, Washington, DC, 423--432. Google ScholarDigital Library
- Qureshi, M. K., Jaleel, A., Patt, Y. N., Steely, S. C., and Emer, J. 2007. Adaptive insertion policies for high performance caching. In Proceedings of the 34th Annual International Symposium on Computer Architecture (ISCA), ACM, New York, NY, 381--391. Google ScholarDigital Library
- Qureshi, M. K. 2009. Adaptive spill-receive for robust high-performance caching in CMPs. In Proceedings of the 15th International Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 45--54.Google ScholarCross Ref
- Rajkumar, R., Lee, C., Lehoczky, J., and Siewiorek, D. 1997. A resource allocation model for QoS management. In Proceedings of the 18th IEEE Real-Time Systems Symposium. IEEE, Washington, DC, 298. Google ScholarDigital Library
- Ramaswamy, S. and Yalamanchili, S. 2007. Improving cache efficiency via resizing + remapping. In Proceedings of the 25th International Conference on Computer Design. IEEE, Washington, DC, 47--54.Google Scholar
- Ranganathan, P., Adve, S., and Jouppi, N. P. 2000. Reconfigurable caches and their application to media processing. In Proceedings of the 27th Annual International Symposium on Computer Architecture. ACM, New York, NY, 214--224. Google ScholarDigital Library
- Rawlins, M. and Gordon-Ross, A. 2011. CPACT -- the conditional parameter adjustment cache tuner for dual-core architectures. In Proceedings of the IEEE International Conference of Computer Design (ICCD). IEEE, Los Alamitos, CA. Google ScholarDigital Library
- Rawlins, M. and Gordon-Ross, A. 2012. An application classification guided cache tuning heuristic for multi-core architectures. In Proceedings of the Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, Piscataway, NJ.Google Scholar
- Renau, J., Fraguela, B., Tuck, J., Liu, W., Prvulovic, M., Ceze, L., Strauss, K., Sarangi, S., Sack, P., and Montesinos, P. 2005. SESC Simulator. http://sesc.sourceforge.net.Google Scholar
- Rico, A., Duran, A., Cabarcas, F., Etsion, Y., Ramirez, A., and Valero, M. 2011. Trace-driven simulation of multithreaded applications. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, Piscataway, New Jersey, 87--96. Google ScholarDigital Library
- Rosenblum, M., Bugnion, E., Devine, S., and Herrod S.A. 1997. Using the SimOS machine simulator to study complex computer systems. ACM Trans. Model. Comput. Simul. 7, 1.78--103. Google ScholarDigital Library
- Sanchez, H., Kuttanna, B., Olson, T., Alexander, M., Gerosa, G., Philip, R., and Alvarez, J. 1997. Thermal management system for high performance PowerPC#8482; microprocessors. In Proceedings of the 42nd IEEE International Computer Conference. IEEE, Washington, DC, 325--330. Google ScholarDigital Library
- Segars, S. 2001. Low power design techniques for microprocessors. In Proceedings of the International Solid State Circuit Conference.Google Scholar
- Shen, X., Zhong, Y., and Ding, C. 2004. Locality phase prediction. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systens. ACM, New York, NY, 165--176. Google ScholarDigital Library
- Shen, X., Zhong, Y., and Ding, C. 2005. Phase-based miss rate prediction across program inputs. In Proceedings of the 17th International Workshop on Languages and Compilers for High Performance Computing, Springer, Berlin, Heidelberg, Germany, 42--55. Google ScholarDigital Library
- Sherwood, T., Perelman, E., and Calder, B. 2001. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques., IEEE, Washington, DC, 3--14. Google ScholarDigital Library
- Sherwood, T., Sair, S., and Calder, B. 2003. Phase tracking and prediction. In Proceedings of the 30th Annual International Symposium on Computer Architecture. ACM, New York, NY, 336--349. Google ScholarDigital Library
- Sherwood, T., Perelman, E., Hamerly, G., Sair, S., and Calder, B. 2003. Discovering and exploiting program phases. IEEE Micro, IEEE, Los Alamitos, CA, 23, 6, 84--93. Google ScholarDigital Library
- Shi, X., Su, F., Peir, J., Xia, Y., and Yang, Z. 2009. Modeling and stack simulation of CMP cache capacity and accessibility. IEEE Trans. Parallel Distrib. Syst. 20, 12, 1752--1763. Google ScholarDigital Library
- Shiue, W. and Chakrabarti, C. 2001. Memory design and exploration for low power, embedded systems. The J. VLSI Signal Process. Syst. 29, 3, 167--178. Google ScholarDigital Library
- Srikantaiah, S., Kandemir, M., and Irwin, M. 2008. Adaptive set pinning: Managing shared caches in chip multiprocessors. In Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, New York, NY, 135--144. Google ScholarDigital Library
- Srikantaiah, S., Kultursay, E., Zhang, T., Kandemir, M., Irwin, M., and Xie, Y. 2011. MorphCache: A reconfigurable adaptive multi-level cache hierarchy for CMPs. In Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA). IEEE, Washington, DC, 231--242. Google ScholarDigital Library
- Srivastava, A. and Eustace, A. 1994. ATOM: A system for building customized program analysis tools. Tech. rep. 94/2, Western Research Lab, Compaq.Google ScholarDigital Library
- Suh, G. E., Rudolph, L., and Devadas, S. 2004. Dynamic partitioning of shared cache memory. J. Supercompu. 28, 1, 7--26. Google ScholarDigital Library
- Sugumar, R. and Abraham, S. 1991. Efficient simulation of multiple cache configurations using binomial trees. Tech. rep. CSE-TR-111-91.Google Scholar
- Sugumar, R. A. 1993. Multi-reconfiguration simulation algorithms for the evaluation of computer architecture designs. Ph.D. Thesis, University of Michigan, Ann Arbor, MI. Google ScholarDigital Library
- Tarjan, D., Thoziyoor, S., and Jouppi, N. P. 2006. CACTI 4.0, Hewlett-Packard Laboratories Technical Report # HPL-2006-86.Google Scholar
- Thompson, J. G., and Smith, A. J. 1989. Efficient (stack) algorithms for analysis of write-back and sector memories. ACM Transactions on Computer Systems, 7, 1, 78--117. Google ScholarDigital Library
- Ishihara, T. and Fallah, F. 2005. A non-uniform cache architecture for low power system design. IN Proceedings of the International Symposium on Low Power Electronics and Design. ACM, New York, NY, 363--368. Google ScholarDigital Library
- Uhlig, R. A. and Mudge, T.N. 1997. Trace-driven memory simulation: A survey. ACM Comput. Surv. 29, 2, 128--170. Google ScholarDigital Library
- Varadarajan, K., Nandy, S., Sharda, V., Bharadwaj, A., Iyer, R., Makineni, S., and Newell, D. 2006. Molecular caches: A caching structure for dynamic creation of application-specific heterogeneous cache regions. In Proceedings of the (MICRO), IEEE, Los Alamitos, CA, 433--442. Google ScholarDigital Library
- Veidenbaum, A., Tang, W., Gupta, R., Nicolau, A., and Ji. X. 1999. Adapting cache line size to application behavior. In Proceedings of the International Conference on Supercomputing. ACM, New York, NY, 145--154. Google ScholarDigital Library
- Venkatachalam, V. and Franz, M. 2005. Power reduction techniques for microprocessor systems. ACM Comput. Surv. 37, 3, 195--237. Google ScholarDigital Library
- Vera, X., Bermudo, N., Llosa, J., and Gonzalez, A. 2004. A fast and accurate framework to analyze and optimize cache memory behavior. ACM Trans. Program. Lang. Syst. 26, 2, 263--300. Google ScholarDigital Library
- Viana, P., Gordon-Ross, A., Keogh, E., Barros, E., and Vahid, F. 2006. Configurable cache subsetting for fast cache tuning. In Proceedings of the ACM Design Automation Conference. ACM, New York, NY, 695--900. Google ScholarDigital Library
- Viana, P., Gordon-Ross, A., Baros, E., and Vahid, F. 2008. A table-based method for single-Pass cache optimization. In Proceedings of the 18th ACM Great Lakes Symposium on VLSI. ACM, New York, NY, 71--76. Google ScholarDigital Library
- Vivekanandarajah, K., Sirkanthan, T., and Clarke, C. T. 2006. Profile directed instruction cache tuning for embedded systems. In Proceedings of the IEEE Computer Society Annual Symposium on Emerging VLSI Technologies and Architectures. IEEE, Washington, DC, 227. Google ScholarDigital Library
- Wan, H., Gao, X., Long, X., and Wang, Z. 2009. GCSim: A GPU-based trace-driven simulator for multi-level cache. Advan. Parallel Process. Technol. 177--190. Google ScholarDigital Library
- Wenisch, T. F., Wunderlich, R. E., Ferdman, M., Ailamaki, A., Falsafi, B., and Hoe, J. C. 2006. SimFlex:Statistical sampling of computer system simulation. IEEE Micro 26, 4, 18--31. Google ScholarDigital Library
- Witchell, E. and Rosenblum, M. 1996. Embra: Fast and flexible machine simulation. In Proceedings of the ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems. ACM, New York, NY, 68--79. Google ScholarDigital Library
- Wunderlich, R. E., Wenisch, T. F., Falsafi, B. and Hoe, J. C. 2003. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the Annual International Symposium on Computer Architecture (ISCA), IEEE, Washington, DC, 84--95. Google ScholarDigital Library
- Xiang, X., Bao, B., Bai, T., Ding, C., and Chilimbi, T. 2011. All-window profiling and composable models of cache sharing. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming. ACM New York, NY, 91--102. Google ScholarDigital Library
- Xie, Y. and Loh, G. 2009. PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches. ACM SIGARCH Comput. Architec. News 37, 3, ACM New York, NY, 174--183. Google ScholarDigital Library
- Xu, C., Chen, X. Dick, R. P., and Mao, Z. M. 2010. Cache contention and application performance prediction for multi-core systems. In Proceedings of the IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS). IEEE, Piscataway, New Jersey, 76--86.Google Scholar
- Yeh, T. and Reinman, G. 2005. Fast and fair: Data-stream quality of service. In Proceedings of the International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES). ACM New York, NY, 237--248. Google ScholarDigital Library
- Yourst, M. T. 2007. PTLsim: A cycle accurate full system x86-64 microarchitectural simulator. In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), IEEE, Piscataway, NJ, 23--34.Google ScholarCross Ref
- Zang, W. and Gordon-Ross, A. 2011. T-SPaCS - a two-level single-pass cache simulation methodology. In Proceedings of the 16th Asia and South Pacific Design Automation Conference. IEEE, Piscataway, NJ, 419--424. Google ScholarDigital Library
- Zhang, W., Hu, J. S., Degalahal, V., Kandemir, M., Vijaykrishnan, N., and Irwin, M. J. 2002. Compiler-directed instruction cache leakage optimization. In Proceedings of the 35th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-35). IEEE, Los Alamitos, CA, 208--218. Google ScholarDigital Library
- Zhang, Y., Parikh, D., Sankaranarayanan, K. Skadron, K., and Stan, M. 2003. HotLeakage: A temperature-aware model of subthreshold and gate leakage for architects. Tech. rep. CS-2003-05, Department of Computer Science, University of Virginia, Charlottesville, VA.Google Scholar
- Zhang, C., Vahid, F., and Lysecky, R. 2004. A self-tuning cache architecture for embedded systems. Special issue on Dynamically Adaptable Embedded System. ACM Trans. Embed. Comput. Syst. 3, 2, 1--19. Google ScholarDigital Library
- Zhou, H., Toburen, M. C., Rotenberg, E., and Conte, T. 2003. Adaptive mode control: A static-power-efficient cache design. ACM Trans. Embed. Comput. Syst. 2, 3, 347--372. Google ScholarDigital Library
- Zhong, Y., Dropsho, S., and Ding, C. 2003. Miss rate prediction across all program inputs. In Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques. IEEE, Washington, DC, 91--101. Google ScholarDigital Library
Index Terms
- A survey on cache tuning from a power/energy perspective
Recommendations
Energy-efficient synonym data detection and consistency for virtual cache
The cache memory consumes a large proportion of the energy used by a processor. In the on-chip cache, the translation lookaside buffer (TLB) accounts for 20-50% of energy consumption of the on-chip cache. To reduce energy consumption caused by TLB ...
A self-tuning configurable cache
DAC '07: Proceedings of the 44th annual Design Automation ConferenceThe memory hierarchy of a system can consume up to 50% of microprocessor system power. Previous work has shown that tuning a configurable cache to a particular application can reduce memory subsystem energy by 62% on average. We introduce a self-tuning ...
Minimizing energy for wireless web access with bounded slowdown
MobiCom '02: Proceedings of the 8th annual international conference on Mobile computing and networkingOn many battery-powered mobile computing devices, the wireless network is a significant contributor to the total energy consumption. In this paper, we investigate the interaction between energy-saving protocols and TCP performance for Web like ...
Comments