Abstract
Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architecturtes do not provide a fine-grained, low-overhead mechanism for observing and reacting to memory behavior directly. To fill this need, this article proposes a new class of memory operations called informing memory operations, which essentially consist of a memory operatin combined (either implicitly or explicitly) with a conditional branch-and-ink operation that is taken only if the reference suffers a cache miss. This article describes two different implementations of informing memory operations. One is based on a cache-outcome condition code, and the other is based on low-overhead traps. We find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and we look at cache coherence with fine-grained access control as a case study. Our performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions. We believe that the inclusion of informing memory operations in future processors may spur even more innovative performance optimizations.
- ABU-SUFAH, W., KUCK, D. J., AND LAWRIE, D.H. 1979. Automatic program transformations for virtual memory computers. In Proceedings of the 1979 National Computer Conference, 969-974.Google Scholar
- AGARWAL, A., BIANCHINI, R., AND CHAIKEN, D. 1995. The MIT Alewife machine: Architecture and performance. In Proceedings of the 22nd International Symposium on Computer Architecture (Santa Margherita Ligure, Italy, June 22-24, 1995). ACM Press, New York, NY. Google Scholar
- AGARWAL, A., KUBIATOWICZ, J., AND KRANZ, D. 1993. Sparcle: An evolutionary processor design for large-scale multiprocessors. IEEE Micro 13, 48-61. Google Scholar
- ALVERSON, R., CALLAHAN, D., AND CUMMINGS, D. 1990. The Tera Computer System. In Proceedings of the 1990 International Conference on Supercomputing (Amsterdam, The Netherlands, June 11-15, 1990). ACM Press, New York, NY, 1-6. Google Scholar
- ANDERSON, J. M., BERC, L. M., DEAN, J., GHEMAWAT, S., HENZINGER, M. R., LEUNG, S.-T. A., SITES, R. L., VANDEVOORDE, M. T., WALDSPURGER, C. A., AND WEIHL, W. E. 1997. Continuous profiling: Where have all the cycles gone? ACM Trans. Comput. Syst. 15, 4 (Nov.), 357-390. Google Scholar
- BERSHAD, B., LEE, D., ROMER, T. H., AND CHEN, J. B. 1994. Avoiding conflict misses dynamically in large direct-mapped caches. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA, Oct. 4-7, 1994). ACM Press, New York, NY, 158-170. Google Scholar
- BLUMRICH, M. A., LI, K., ALPERT, R., DUBNICKI, C., FELTEN, E. W., AND SANDBERG, J. 1994. Virtual memory mapped network interface for the SHRIMP multicomputer. In Proceedings of the 21st International Symposium on Computer Architecture (Chicago, Ill., April 18-21, 1994). IEEE Computer Society Press, Los Alamitos, CA, 142-153. Google Scholar
- BURKHART, g. AND MILLEN, R. 1989. Performance-measurement tools in a multiprocessor environment. IEEE Trans. Comput. 38, 5 (May), 725-737. Google Scholar
- CHANDRA, R., DEVINE, S., VERGHESE, B., GUPTA, A., AND ROSENBLUM, M. 1994. Scheduling and page migration for multiprocessor compute servers. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA, Oct. 4-7, 1994). ACM Press, New York, NY, 12-24. Google Scholar
- COOPER, K., HALL, M., AND KENNEDY, K. 1993. A methodology for procedure cloning. Comput. Lang. 19, 2 (Apr.).Google Scholar
- COVINGTON, R. C., MADALA, S., MEHTA, V., JUMP, J. R., AND SINCLAIR, J.B. 1988. The Rice parallel processing testbed. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (Santa Fe, New Mexico, May 24-27, 1988). ACM Press, New York, NY, 4-11. Google Scholar
- DEAN, J., HICKS, J., WALDSPURGER, C. A., WEIHL, W., AND CHRYSOS, G. 1997. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In Proceedings of Micro-30. Google Scholar
- DIGITAL EQUIPMENT. 1992. DECChip 21064 RISC microprocessor preliminary data sheet. Tech. Rep., Digital Equipment Corp., Maynard, MA.Google Scholar
- DIXIT, K. M. 1992. New CPU benchmark suites from SPEC. In Proceedings of 37th International Conference on Computer Communications (San Francisco, CA, Feb. 24-28, 1992). IEEE Computer Society Press, Los Alamitos, CA, 305-310. Google Scholar
- DONGARRA, J. J., BREWER, O., KOHL, J. A., AND FINEBERG, S. 1990. A tool to aid in the design, implementation, and understanding of matrix algorithms for parallel processors. J. Parallel Distrib. Comput. 9, 2 (June), 185-202. Google Scholar
- EDMONDSON, J. H., RUBINFELD, P. I., BANNON, P. J., BENSCHNEIDER, B. J., BERNSTEIN, D., CASTELINO, R. W., COOPER, E. M., DEVER, D. E., DONCHIN, D. R., FISCHER, T. C., JAIN, A. K., MEHTA, S., MEYER, J. E., PRESTON, R. P., RAJAGOPALAN, V., SOMANATHAN, C., TAYLOR, S. A., AND WOLRICH, a.M. 1995. Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor. Digital Tech. J. 7, 1 (Jan.), 119-135. Google Scholar
- PARKAS, K. AND JOUPPI, N. 1994. Complexity/performance tradeoffs with non-blocking loads. In Proceedings of the 21st International Symposium on Computer Architecture (Chicago, Ill., April 18-21, 1994). IEEE Computer Society Press, Los Alamitos, CA, 211-222. Google Scholar
- FISHER, J. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput. C-30, 7 (July), 478-490.Google Scholar
- GALLIVAN, K., JALBY, W., MEIER, U., AND SAMEH, A. 1987. The impact of hierarchical memory systems on linear algebra algorithm design. Tech. Rep. UIUCSRD 625, University of Illinois at Urbana-Champaign, Champaign, IL.Google Scholar
- GOLDBERG, A. J. AND HENNESSY, J.L. 1993. Mtool: An integrated system for performance debugging shared memory multiprocessor applications. IEEE Trans. Parallel Distrib. Syst. 4, 1 (Jan.), 28-40. Google Scholar
- HEINRICH, J. 1995. MIPS R10000 microprocessor user's manual. MIPS Technologies, Inc.Google Scholar
- HOROWITZ, M., MARTONOSI, M., MOWRY, T., AND SMITH, M. D. 1995. Informing loads: Enabling software to observe and react to memory behavior. Tech. Rep. CSL-TR-95-602, Computer Systems Laboratory, Stanford University, Stanford, CA. Google Scholar
- HOROWITZ, M., MARTONOSI, M., MOWRY, T., AND SMITH, M. D. 1996. Informing memory operations: Providing performance feedback in modern processors. In Proceedings of the 23rd International Symposium on Computer Architecture (Philadelphia, PA, May). ACM Press, New York, NY. Google Scholar
- JouPPI, N. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th International Symposium on Computer Architecture. Google Scholar
- LAM, M., ROTHBERG, E., AND WOLF, M. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (Santa Clara, CA, Apr. 8-11). ACM Press, New York, NY, 63-74. Google Scholar
- LAUDON, J., GUPTA, A., AND HOROWITZ, M. 1994. Interleaving: A multithreading technique targeting multiprocessors and workstations. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA, Oct. 4-7, 1994). ACM Press, New York, NY, 308-318. Google Scholar
- LEBECK, A. R. AND WOOD, D.A. 1994. Cache profiling and the SPEC benchmarks: A case study. Computer 27, 10 (Oct.), 15-26. Google Scholar
- LENOSKI, D., LAUDON, J., GHARACHORLOO, K., WEBER, W.-D., GUPTA, A., HENNESSY, J., HOROWITZ, M., AND LAM, M.S. 1992. The Stanford Dash multiprocessor. Computer 25, 3 (Mar.), 63-79. Google Scholar
- LUK, C.-K. AND MOWRY, T. C. 1996. Compiler-based prefetching for recursive data structures. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York, NY, 222-233. Google Scholar
- MARTONOSI, M., GUPTA, n., AND ANDERSON, T. 1995. Tuning memory performance in sequential and parallel programs. Computer 28, 4 (Apr.), 32-40. Google Scholar
- MATHISEN, T. 1994. Pentium secrets. BYTE 19, 7, 191-192.Google Scholar
- MOWRY, T.C. 1995. Tolerating latency through software-controlled data prefetching. Tech. Rep. CSL-TR-94-626, Stanford University, Stanford, CA. Google Scholar
- MOWRY, T. C. AND LUK, C.-K. 1997. Predicting data cache misses in non-numeric applications through correlation profiling. In Proceedings of Micro-30. Google Scholar
- MOWRY, T. C., LAM, M. S., AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (Boston, MA, Oct. 12-15). ACM Press, New York, NY, 62-73. Google Scholar
- NOWATZYK, A., AYBAY, G., AND BROWNE, M. 1994. The S3.mp scalable shared memory multiprocessor. In Proceedings of the 27th Hawaiian International Conference on System Sciences. Vol. 1, Architecture. IEEE Computer Society Press, Los Alamitos, CA, 144-153.Google Scholar
- PAUL, R.P. 1994. SPARC Architecture, Assembly Language Programming, and C. Prentice- Hall, Inc., Upper Saddle River, NJ. Google Scholar
- PORTERFIELD, A. K. 1989. Software methods for improvement of cache performance on supercomputer applications. Ph.D thesis, Rice University, Houston, TX. Google Scholar
- REINHARDT, S. K., LARUS, J. R., AND WOOD, D.A. 1994. Tempest and Typhoon: User-level shared memory. In Proceedings of the 21st International Symposium on Computer Architecture (Chicago, Ill., April 18-21, 1994). IEEE Computer Society Press, Los Alamitos, CA, 325-337. Google Scholar
- SCHOINAS, I., FALSAFI, B., LEBECK, A. R., REINHARDT, S. K., LARUS, J. R., AND WOOD, D. A. 1994. Fine-grain access control for distributed shared memory. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA, Oct. 4-7, 1994). ACM Press, New York, NY, 297-306. Google Scholar
- SINGH, J. P., WEBER, W.-D., AND GUPTA, A. 1992. SPLASH: Stanford parallel applications for shared-memory. SIGARCH Comput. Archit. News 20, 1 (Mar.), 5-44. Google Scholar
- SINGHAL, A. AND GOLDBERG, A. J. 1994. Architectural support for performance tuning: A case study on the SPARCcenter 2000. In Proceedings of the 21st International Symposium on Computer Architecture (Chicago, Ill., April 18-21, 1994). IEEE Computer Society Press, Los Alamitos, CA, 48-59. Google Scholar
- SMITH, B. J. 1981. Architecture and applications of the HEP Multiprocessor Computer System. In SPIE Real-Time Signal Processing IV. SPIE Press, Bellingham, WA.Google Scholar
- SMITH, M.D. 1992. Support for speculative execution in high-performance processors. Ph.D. thesis, Stanford University, Stanford, CA. Google Scholar
- THEKKATH, R. AND EGGERS, S.J. 1994. The effectiveness of multiple hardware contexts. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA, Oct. 4-7, 1994). ACM Press, New York, NY, 328-337. Google Scholar
- TJIANG, S. W. K. AND HENNESSY, J. L. 1992. Sharlit: A tool for building optimizers. In Proceedings of the 5th ACM SIGPLAN Conference on Programming Language Design and Implementation (San Francisco, CA, June 17-19), R. L. Wexelblat, Ed. ACM Press, New York, NY. Google Scholar
- WOLF, M. E. AND LAM, M.S. 1991. A data locality optimization algorithm. In Proceedings of the 4th ACM SIGPLAN Conference on Programming Language Design and Implementation (Toronto, Ontario, Canada, June 26-28). ACM Press, New York, NY, 30-44. Google Scholar
- YEAGER, K. C. 1996. The MIPS R10000 superscalar microprocessor. IEEE Micro 16, 2 (Apr.), 28-40. Google Scholar
Index Terms
Informing memory operations: memory performance feedback mechanisms and their applications
Recommendations
Informing memory operations: providing memory performance feedback in modern processors
Special Issue: Proceedings of the 23rd annual international symposium on Computer architecture (ISCA '96)Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the ...
Informing memory operations: providing memory performance feedback in modern processors
ISCA '96: Proceedings of the 23rd annual international symposium on Computer architectureMemory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the ...
Comments