ABSTRACT
A major challenge in fine-grained computing is achieving locality without excessive scheduling overhead. We built two J-Machine implementations of a fine-grained programming model, the Berkeley Threaded Abstract Machine. One implementation takes an Active Messages approach, maintaining a scheduling hierarchy in software in order to improve data cache performance. Another approach relies on the J-Machine's message queues and fast task switch, lowering the control costs at the expense of data locality. Our analysis measures the costs and benefits of each approach, for a variety of programs and cache configurations. The Active Messages implementation is strongest when miss penalties are high and for the finest-grained programs. The hardware-buffered implementation is strongest in direct-mapped caches, where it achieves substantially better instruction cache performance.
- AHN88.Arvind, S. K. Heller, and R. S. Nikhil. Programming generality and parallel computers. In Proceedings of the 4th International Symposium on Biological and Artificial intelligence Systems, pages 255-286, Trento, Italy, September 1988. ESCOM (Leider).Google Scholar
- CGSvE93.D.E. Culler, S. C. Goldstein, K. E. Schauser, and T. von Eicken. TAM m A Compiler Controlled Threaded Abstract Machine. Journal of Parallel and Distributed Computing, 18:347-370, July 1993. Google ScholarDigital Library
- CSS+91.D. Culler, A. Sah, K. Schauser, T. von Eicken, and J. Wawrzynek. Fine-grain Parallelism with Minimal Hardware Support: A Compiler-Controlled Threaded Abstract Machine. In Proc. of 4th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, Santa-Clara, CA, April 1991. Also available as Technical Report UCB/CSD 91/591, CS Div., University of California at Berkeley. Google ScholarDigital Library
- D+87.William J. Dally et al. Architecture of a message-driven processor. In Proceedings of the 14th International Symposium on Computer Architecture, pages 189-205. IEEE, June 1987. Google ScholarDigital Library
- DFK+92.William J. Dally, J. A. Stuart Fiske, John S. Keen, Richard A. Lethin, Michael D. Noakes, Peter R. Nuth, Roy E. Davison, and Gregory A. Fyler. The Message-Driven Processor: A multicomputer processing node with efficient mechanisms. IEEE Micro, 12(2):23-39, April 1992. Google ScholarDigital Library
- MB91.Jeffrey C. Mogul and Anita Borg. The effect of context switches on cache performance. in Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Santa Clara, California, April 1991. Google ScholarDigital Library
- ML95.David Metz and Ben Lee. Analyzing the benefits of a separate processor to handle messages for fine-grain multithreading. Technical Report TRECE95.03, Department of Electrical and Computer Engineering, Oregon State University, 1995. Submitted to the Seventh IEEE Symposium on Parallel and Distributed Processing. Google ScholarDigital Library
- Nik91.R.S. Nikhil. Id (version 90.1) reference manual. CSG Memo 284-2, MIT Laboratory for Computer Science, 545 Technology Square, Cambridge, MA 02139, USA, July 1991.Google Scholar
- Nik93.Rishiyur S. Nikhil. A multithreaded implementation of Id using P-RISC graphs. In Proceedings of the Workshop on Languages and Compilers for Parallel Computing, Portland, OR, 1993. Google ScholarDigital Library
- Sch91.Klaus Erik Schauser. Compiling dataflow into threads. Master's thesis, Computer Science Division, University of California at Berkeley, 1991.Google Scholar
- SGS+93.Ellen Spertus, Seth Copen Goldstein, Klaus Erik Schauser, Thorsten von Eicken, David E. Culler, and William J. Dally. Evaluation of mechanisms for fine-grained parallel programs in the J-Machine and the CM-5. In Proceedings of the International Symposium on Computer Architecture, pages 302-313, May 1993. Google ScholarDigital Library
- Spe92.Ellen Spertus. Execution of Dataflow Programs on General-Purpose Hardware. Master's thesis, Department of EECS, Massachusetts Institute of Technology, 545 Tech. Square, Cambridge, MA, August 1992. To be expanded and released as Technical Report 1380.Google Scholar
- TCS92.K.R. Traub, D. E. Culler, and K. E. Schauser. Global analysis for partitioning non-strict programs into sequential threads. A CM LISP Pointers, 5(1):324-334, 1992. Proceedings of the 1992 ACM Conference on LISP and Functional Programming. Google ScholarDigital Library
- Thi92.Thinking Machines Corporation, Cambridge, Massachusetts. The Connection Machine CM-5 Technical Summary, January 1992.Google Scholar
- vECGS92.Thorsten yon Eicken, David E. Culler, Seth Copen Goldstein, and Klaus Erik Schauser. Active Messages: a Mechanism for integrated Communication and Computation. In Proc. of the 19th Int'l Symposium on Computer Architecture, Gold Coast, Australia, May 1992. Google ScholarDigital Library
- WHJ+95.Deborah A. Wallach, Wilson C. Hsieh, Kirk L. Johnson, M. Frans Kaashoek, and William E. Weihl. Optimistic active messages: A mechanism for scheduling communication with computation. In Proceedings of the Fifth A CM SIGPLAN Symposium on Principles and Practice of Parallel Programming, July 1995. Google ScholarDigital Library
Index Terms
- Evaluating the locality benefits of active messages
Recommendations
Evaluating the locality benefits of active messages
A major challenge in fine-grained computing is achieving locality without excessive scheduling overhead. We built two J-Machine implementations of a fine-grained programming model, the Berkeley Threaded Abstract Machine. One implementation takes an ...
Exploiting reuse locality on inclusive shared last-level caches
Special Issue on High-Performance Embedded Architectures and CompilersOptimization of the replacement policy used for Shared Last-Level Cache (SLLC) management in a Chip-MultiProcessor (CMP) is critical for avoiding off-chip accesses. Temporal locality, while being exploited by first levels of private cache memories, is ...
Exploiting spatial locality in data caches using spatial footprints
Special Issue: Proceedings of the 25th annual international symposium on Computer architecture (ISCA '98)Modern cache designs exploit spatial locality by fetching large blocks of data called cache lines on a cache miss. Subsequent references to words within the same cache line result in cache hits. Although this approach benefits from spatial locality, ...
Comments