skip to main content
article
Free Access

Informing memory operations: memory performance feedback mechanisms and their applications

Published:01 May 1998Publication History
Skip Abstract Section

Abstract

Memory latency is an important bottleneck in system performance that cannot be adequately solved by hardware alone. Several promising software techniques have been shown to address this problem successfully in specific situations. However, the generality of these software approaches has been limited because current architecturtes do not provide a fine-grained, low-overhead mechanism for observing and reacting to memory behavior directly. To fill this need, this article proposes a new class of memory operations called informing memory operations, which essentially consist of a memory operatin combined (either implicitly or explicitly) with a conditional branch-and-ink operation that is taken only if the reference suffers a cache miss. This article describes two different implementations of informing memory operations. One is based on a cache-outcome condition code, and the other is based on low-overhead traps. We find that modern in-order-issue and out-of-order-issue superscalar processors already contain the bulk of the necessary hardware support. We describe how a number of software-based memory optimizations can exploit informing memory operations to enhance performance, and we look at cache coherence with fine-grained access control as a case study. Our performance results demonstrate that the runtime overhead of invoking the informing mechanism on the Alpha 21164 and MIPS R10000 processors is generally small enough to provide considerable flexibility to hardware and software designers, and that the cache coherence application has improved performance compared to other current solutions. We believe that the inclusion of informing memory operations in future processors may spur even more innovative performance optimizations.

References

  1. ABU-SUFAH, W., KUCK, D. J., AND LAWRIE, D.H. 1979. Automatic program transformations for virtual memory computers. In Proceedings of the 1979 National Computer Conference, 969-974.Google ScholarGoogle Scholar
  2. AGARWAL, A., BIANCHINI, R., AND CHAIKEN, D. 1995. The MIT Alewife machine: Architecture and performance. In Proceedings of the 22nd International Symposium on Computer Architecture (Santa Margherita Ligure, Italy, June 22-24, 1995). ACM Press, New York, NY. Google ScholarGoogle Scholar
  3. AGARWAL, A., KUBIATOWICZ, J., AND KRANZ, D. 1993. Sparcle: An evolutionary processor design for large-scale multiprocessors. IEEE Micro 13, 48-61. Google ScholarGoogle Scholar
  4. ALVERSON, R., CALLAHAN, D., AND CUMMINGS, D. 1990. The Tera Computer System. In Proceedings of the 1990 International Conference on Supercomputing (Amsterdam, The Netherlands, June 11-15, 1990). ACM Press, New York, NY, 1-6. Google ScholarGoogle Scholar
  5. ANDERSON, J. M., BERC, L. M., DEAN, J., GHEMAWAT, S., HENZINGER, M. R., LEUNG, S.-T. A., SITES, R. L., VANDEVOORDE, M. T., WALDSPURGER, C. A., AND WEIHL, W. E. 1997. Continuous profiling: Where have all the cycles gone? ACM Trans. Comput. Syst. 15, 4 (Nov.), 357-390. Google ScholarGoogle Scholar
  6. BERSHAD, B., LEE, D., ROMER, T. H., AND CHEN, J. B. 1994. Avoiding conflict misses dynamically in large direct-mapped caches. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA, Oct. 4-7, 1994). ACM Press, New York, NY, 158-170. Google ScholarGoogle Scholar
  7. BLUMRICH, M. A., LI, K., ALPERT, R., DUBNICKI, C., FELTEN, E. W., AND SANDBERG, J. 1994. Virtual memory mapped network interface for the SHRIMP multicomputer. In Proceedings of the 21st International Symposium on Computer Architecture (Chicago, Ill., April 18-21, 1994). IEEE Computer Society Press, Los Alamitos, CA, 142-153. Google ScholarGoogle Scholar
  8. BURKHART, g. AND MILLEN, R. 1989. Performance-measurement tools in a multiprocessor environment. IEEE Trans. Comput. 38, 5 (May), 725-737. Google ScholarGoogle Scholar
  9. CHANDRA, R., DEVINE, S., VERGHESE, B., GUPTA, A., AND ROSENBLUM, M. 1994. Scheduling and page migration for multiprocessor compute servers. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA, Oct. 4-7, 1994). ACM Press, New York, NY, 12-24. Google ScholarGoogle Scholar
  10. COOPER, K., HALL, M., AND KENNEDY, K. 1993. A methodology for procedure cloning. Comput. Lang. 19, 2 (Apr.).Google ScholarGoogle Scholar
  11. COVINGTON, R. C., MADALA, S., MEHTA, V., JUMP, J. R., AND SINCLAIR, J.B. 1988. The Rice parallel processing testbed. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems (Santa Fe, New Mexico, May 24-27, 1988). ACM Press, New York, NY, 4-11. Google ScholarGoogle Scholar
  12. DEAN, J., HICKS, J., WALDSPURGER, C. A., WEIHL, W., AND CHRYSOS, G. 1997. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In Proceedings of Micro-30. Google ScholarGoogle Scholar
  13. DIGITAL EQUIPMENT. 1992. DECChip 21064 RISC microprocessor preliminary data sheet. Tech. Rep., Digital Equipment Corp., Maynard, MA.Google ScholarGoogle Scholar
  14. DIXIT, K. M. 1992. New CPU benchmark suites from SPEC. In Proceedings of 37th International Conference on Computer Communications (San Francisco, CA, Feb. 24-28, 1992). IEEE Computer Society Press, Los Alamitos, CA, 305-310. Google ScholarGoogle Scholar
  15. DONGARRA, J. J., BREWER, O., KOHL, J. A., AND FINEBERG, S. 1990. A tool to aid in the design, implementation, and understanding of matrix algorithms for parallel processors. J. Parallel Distrib. Comput. 9, 2 (June), 185-202. Google ScholarGoogle Scholar
  16. EDMONDSON, J. H., RUBINFELD, P. I., BANNON, P. J., BENSCHNEIDER, B. J., BERNSTEIN, D., CASTELINO, R. W., COOPER, E. M., DEVER, D. E., DONCHIN, D. R., FISCHER, T. C., JAIN, A. K., MEHTA, S., MEYER, J. E., PRESTON, R. P., RAJAGOPALAN, V., SOMANATHAN, C., TAYLOR, S. A., AND WOLRICH, a.M. 1995. Internal organization of the Alpha 21164, a 300-MHz 64-bit quad-issue CMOS RISC microprocessor. Digital Tech. J. 7, 1 (Jan.), 119-135. Google ScholarGoogle Scholar
  17. PARKAS, K. AND JOUPPI, N. 1994. Complexity/performance tradeoffs with non-blocking loads. In Proceedings of the 21st International Symposium on Computer Architecture (Chicago, Ill., April 18-21, 1994). IEEE Computer Society Press, Los Alamitos, CA, 211-222. Google ScholarGoogle Scholar
  18. FISHER, J. 1981. Trace scheduling: A technique for global microcode compaction. IEEE Trans. Comput. C-30, 7 (July), 478-490.Google ScholarGoogle Scholar
  19. GALLIVAN, K., JALBY, W., MEIER, U., AND SAMEH, A. 1987. The impact of hierarchical memory systems on linear algebra algorithm design. Tech. Rep. UIUCSRD 625, University of Illinois at Urbana-Champaign, Champaign, IL.Google ScholarGoogle Scholar
  20. GOLDBERG, A. J. AND HENNESSY, J.L. 1993. Mtool: An integrated system for performance debugging shared memory multiprocessor applications. IEEE Trans. Parallel Distrib. Syst. 4, 1 (Jan.), 28-40. Google ScholarGoogle Scholar
  21. HEINRICH, J. 1995. MIPS R10000 microprocessor user's manual. MIPS Technologies, Inc.Google ScholarGoogle Scholar
  22. HOROWITZ, M., MARTONOSI, M., MOWRY, T., AND SMITH, M. D. 1995. Informing loads: Enabling software to observe and react to memory behavior. Tech. Rep. CSL-TR-95-602, Computer Systems Laboratory, Stanford University, Stanford, CA. Google ScholarGoogle Scholar
  23. HOROWITZ, M., MARTONOSI, M., MOWRY, T., AND SMITH, M. D. 1996. Informing memory operations: Providing performance feedback in modern processors. In Proceedings of the 23rd International Symposium on Computer Architecture (Philadelphia, PA, May). ACM Press, New York, NY. Google ScholarGoogle Scholar
  24. JouPPI, N. 1990. Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. In Proceedings of the 17th International Symposium on Computer Architecture. Google ScholarGoogle Scholar
  25. LAM, M., ROTHBERG, E., AND WOLF, M. 1991. The cache performance and optimizations of blocked algorithms. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (Santa Clara, CA, Apr. 8-11). ACM Press, New York, NY, 63-74. Google ScholarGoogle Scholar
  26. LAUDON, J., GUPTA, A., AND HOROWITZ, M. 1994. Interleaving: A multithreading technique targeting multiprocessors and workstations. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA, Oct. 4-7, 1994). ACM Press, New York, NY, 308-318. Google ScholarGoogle Scholar
  27. LEBECK, A. R. AND WOOD, D.A. 1994. Cache profiling and the SPEC benchmarks: A case study. Computer 27, 10 (Oct.), 15-26. Google ScholarGoogle Scholar
  28. LENOSKI, D., LAUDON, J., GHARACHORLOO, K., WEBER, W.-D., GUPTA, A., HENNESSY, J., HOROWITZ, M., AND LAM, M.S. 1992. The Stanford Dash multiprocessor. Computer 25, 3 (Mar.), 63-79. Google ScholarGoogle Scholar
  29. LUK, C.-K. AND MOWRY, T. C. 1996. Compiler-based prefetching for recursive data structures. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems. ACM Press, New York, NY, 222-233. Google ScholarGoogle Scholar
  30. MARTONOSI, M., GUPTA, n., AND ANDERSON, T. 1995. Tuning memory performance in sequential and parallel programs. Computer 28, 4 (Apr.), 32-40. Google ScholarGoogle Scholar
  31. MATHISEN, T. 1994. Pentium secrets. BYTE 19, 7, 191-192.Google ScholarGoogle Scholar
  32. MOWRY, T.C. 1995. Tolerating latency through software-controlled data prefetching. Tech. Rep. CSL-TR-94-626, Stanford University, Stanford, CA. Google ScholarGoogle Scholar
  33. MOWRY, T. C. AND LUK, C.-K. 1997. Predicting data cache misses in non-numeric applications through correlation profiling. In Proceedings of Micro-30. Google ScholarGoogle Scholar
  34. MOWRY, T. C., LAM, M. S., AND GUPTA, A. 1992. Design and evaluation of a compiler algorithm for prefetching. In Proceedings of the 5th International Conference on Architectural Support for Programming Languages and Operating Systems (Boston, MA, Oct. 12-15). ACM Press, New York, NY, 62-73. Google ScholarGoogle Scholar
  35. NOWATZYK, A., AYBAY, G., AND BROWNE, M. 1994. The S3.mp scalable shared memory multiprocessor. In Proceedings of the 27th Hawaiian International Conference on System Sciences. Vol. 1, Architecture. IEEE Computer Society Press, Los Alamitos, CA, 144-153.Google ScholarGoogle Scholar
  36. PAUL, R.P. 1994. SPARC Architecture, Assembly Language Programming, and C. Prentice- Hall, Inc., Upper Saddle River, NJ. Google ScholarGoogle Scholar
  37. PORTERFIELD, A. K. 1989. Software methods for improvement of cache performance on supercomputer applications. Ph.D thesis, Rice University, Houston, TX. Google ScholarGoogle Scholar
  38. REINHARDT, S. K., LARUS, J. R., AND WOOD, D.A. 1994. Tempest and Typhoon: User-level shared memory. In Proceedings of the 21st International Symposium on Computer Architecture (Chicago, Ill., April 18-21, 1994). IEEE Computer Society Press, Los Alamitos, CA, 325-337. Google ScholarGoogle Scholar
  39. SCHOINAS, I., FALSAFI, B., LEBECK, A. R., REINHARDT, S. K., LARUS, J. R., AND WOOD, D. A. 1994. Fine-grain access control for distributed shared memory. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA, Oct. 4-7, 1994). ACM Press, New York, NY, 297-306. Google ScholarGoogle Scholar
  40. SINGH, J. P., WEBER, W.-D., AND GUPTA, A. 1992. SPLASH: Stanford parallel applications for shared-memory. SIGARCH Comput. Archit. News 20, 1 (Mar.), 5-44. Google ScholarGoogle Scholar
  41. SINGHAL, A. AND GOLDBERG, A. J. 1994. Architectural support for performance tuning: A case study on the SPARCcenter 2000. In Proceedings of the 21st International Symposium on Computer Architecture (Chicago, Ill., April 18-21, 1994). IEEE Computer Society Press, Los Alamitos, CA, 48-59. Google ScholarGoogle Scholar
  42. SMITH, B. J. 1981. Architecture and applications of the HEP Multiprocessor Computer System. In SPIE Real-Time Signal Processing IV. SPIE Press, Bellingham, WA.Google ScholarGoogle Scholar
  43. SMITH, M.D. 1992. Support for speculative execution in high-performance processors. Ph.D. thesis, Stanford University, Stanford, CA. Google ScholarGoogle Scholar
  44. THEKKATH, R. AND EGGERS, S.J. 1994. The effectiveness of multiple hardware contexts. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems (San Jose, CA, Oct. 4-7, 1994). ACM Press, New York, NY, 328-337. Google ScholarGoogle Scholar
  45. TJIANG, S. W. K. AND HENNESSY, J. L. 1992. Sharlit: A tool for building optimizers. In Proceedings of the 5th ACM SIGPLAN Conference on Programming Language Design and Implementation (San Francisco, CA, June 17-19), R. L. Wexelblat, Ed. ACM Press, New York, NY. Google ScholarGoogle Scholar
  46. WOLF, M. E. AND LAM, M.S. 1991. A data locality optimization algorithm. In Proceedings of the 4th ACM SIGPLAN Conference on Programming Language Design and Implementation (Toronto, Ontario, Canada, June 26-28). ACM Press, New York, NY, 30-44. Google ScholarGoogle Scholar
  47. YEAGER, K. C. 1996. The MIPS R10000 superscalar microprocessor. IEEE Micro 16, 2 (Apr.), 28-40. Google ScholarGoogle Scholar

Index Terms

  1. Informing memory operations: memory performance feedback mechanisms and their applications

            Recommendations

            Reviews

            Herbert G. Mayer

            Memory latency has long been a major bottleneck in system performance, and has been partly resolved through multilevel caching. This important paper shows two ways that informing memory operations (IMOs), a combination of software techniques and small hardware enhancements, can provide a fine-grained mechanism for observing and reacting to memory operations so that cache misses can be significantly reduced. One method is based on cache-outcome condition codes, and the other is based on low-overhead traps. For both approaches, modern in-order-issue as well as out-of-order-issue superscalar architectures already provide the required hardware enhancements. IMOs are low-overhead, hardware-supported operations that transfer control to specified target addresses when L1 cache misses occur, with the goal of reducing future misses. IMOs are generally applicable to CISC or RISC, and to in-order and out-of-order machines; are fine-grained; and have low overhead; but do cause some perturbation. After the introduction in section 1, section 2 explains IMO. Section 3 focuses on implementation methods, which will be important to future computer architects. Section 4 reviews the operations employed by software at the target address after a miss. These operations include performance monitoring, software-controlled prefetching, enforcing cache-coherence in MP systems, and handling L1 cache misses in such a way that the performance gains outweigh the losses introduced by the new method. Section 5 is most interesting, with its case studies on IMO; pseudocode implements and clarifies adaptive prefetching. Section 6 concludes the paper. The extent of the perturbations introduced by the IMO remains unclear. That is a key question, since those perturbations reduce the advantage gained by the new technique. It also remains unclear whether the corrective action can be exploited without using IMO, after the program has gathered the fine-grained information in a previous run through IMO. This would take advantage of the speedup without the interference and perturbation. The experimental results shown in section 4, quantified for the MIPS R10000 and Alpha 21164 over the SPEC92 benchmark through simulation, prove the feasibility of the new approach to reduce the von Neumann bottleneck. The results are particularly convincing for floating-point operations on an out-of-order architecture, providing a serendipitous justification for this more complex architecture by demonstrating its superior performance potential. This paper marks a grand milestone in computer architecture. I am glad I had a chance to review it. History has shown that memory and bus speeds lag considerably behind processor speeds, so the new method is a welcome solution to an old gap. IMO is as important as caching was 30 years ago. It is now time for a revolutionary method, and IMO may just be that method. Everyone in the field must read this paper; we shall hear much more about IMO in the future.

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader