ABSTRACT
This paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn Parallel Performance Tools running with the Blizzard fine-grain distributed shared memory system. This approach exploits the underlying system's cache coherence protocol to detect data sharing patterns that indicate potential performance bottlenecks and presents performance measurements in a data-centric manner. As a demonstration, Parodyn helped us improve the performance of a new shared-memory application program by a factor of four.
- 1.J.B. Carter, J. K. Bennett, and W. Zwaenepoel. Implementation and Performance of Munin. 13th ACM Syrup. on Operating Systems Principles, Oct. 1991. Google ScholarDigital Library
- 2.S. Chandra, B. Richards and J. R. tams. Teapot: Language Support for Writing Memory Coherence Protocols. SIGPLAN Conf. on Programming Languages Design and Implementation (PLDI), Philadelphia, PA, May 1996. Google ScholarDigital Library
- 3.S. Chandra and J. R. Larus. Optimizing Communication in HPF Programs for Fine-Grain Distributed Memory. 6th A CM SIGPLAN Syrup. on Principles and Practice of Parallel Programming. Alexis Park Resort, Las Vegas, Nevada, June 18-21, 1997. Google ScholarDigital Library
- 4.T.M. Chilimbi, T. Ball, S. G. Eric, J. R. Larus. StormWatch: A Tool for Visualizing Memory System Protocols. Supercomputing'95, San Diego, CA, December, 1995. Google ScholarDigital Library
- 5.A.L. Cox and R. J. Fowler. Adaptive Cache coherency for Detecting Migratory Shared Data. 20th Annual lnt'l Syrup. on Computer Architecture, May 1993. Google ScholarDigital Library
- 6.F. Dahlgren, M. Dubois, and P. Stenstrom. Combined Performance Gains of Simple Cache Protocol Extensions. 21th Annual Int'l Syrup. on Computer Architecture, April 1994. Google ScholarDigital Library
- 7.B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood. Application- Specific Protocols for user-level Shared Memory. Supercomputing'94, November, 1994. Google ScholarDigital Library
- 8.A. Gupta, M. Martonosi, and T. Anderson. MemSpy: Analyzing memory system bottlenecks in programs. Performance Evaluation Review 20, 1, June 1992. Google ScholarDigital Library
- 9.M.D. Hill, J.R. Larus, S.K. Reinhardt, and D.A. Wood. Tempest: A Substrate for Portable Parallel Programs. COMPCON'95, San Francisco, March 1995. Google ScholarDigital Library
- 10.J.K. Hollingsworth and B.P. Miller, "Dynamic Control of Performance Monitoring on Large Scale Parallel Systems", int'l Conf. on Supercomputing, Tokyo, July 1993. Google ScholarDigital Library
- 11.J.K. Hollingsworth, B. P. Miller, M. J. R. Gon#alves, O. Naim. Z. Xu and L. Zheng. MDL: A Language and Compiler for Dynamic Program Instrumentation. Tech.l Report, Comp. Science Department, LrW-Madison. Google ScholarDigital Library
- 12.M. Horowitz, M. Martonosi, T. C. Mowry, and M. D. Smith. Informing Memory Operations: Providing Memory Performance Feedback in Modern Processors. 23rd Annual Int'l Syrup. on Comp. Architecture, Philadelphia PA, May 1996. Google ScholarDigital Library
- 13.G.A. Huber and S. Kim. Weighted-Ensemble Brownain Dynamics Simulations for Protein Association Reactions. Biophysical Journal, Vol. 70, January 1996.Google Scholar
- 14.R.B. Irvin and B.P. Miller, "A Performance Tool for High- Level Parallel Programming Languages" in Progranuning Environments for Massively Parallel Distributed Systems, Birkaeuser Verlag, Basel, K.M. Decker and R.M. Rehmann, eds., 1994.Google Scholar
- 15.R.B. Irvin and B.P. Miller, "Mapping Performance Data for High-Level and Data Views of Parallel Program Performance", lnt'l Conf. on Supercomputing, Philadelphia, May 1996. Google ScholarDigital Library
- 16.K.J. Johnson, M. F. Kaashoek, and D. A. Wallach. CRL: High Performance All-Software Distributed Shared Memory. 15th A CM Syrup. on Operating System Principles (SOSP), Copper Mountain, Colorado, December 1995. Google ScholarDigital Library
- 17.P. Keleher, S. Dwarkadas, A. Cox, and W. Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. IEEE Computer 29, 2, February 1996.Google Scholar
- 18.J. Kuskin et al. The Stanford FLASH Multiprocessor. 21st Annual Int'l Syrup. on Comp. Architecture, April 1994. Google ScholarDigital Library
- 19.A.R. Lebeck and D. A. Wood. Cache profiling and spec benchmarks: A case study. IEEE Computer 27, 10, October 1994. Google ScholarDigital Library
- 20.T. Lover and R. Clapp. STING: A CC-NUMA Computer System for the Commercial Marketplace. 23th Annual Int'l Symp. on Comp. Architecture, Philadelphia PA, May 1996. Google ScholarDigital Library
- 21.M. Martonosi, D. Ofelt and M. Heinrich. Integrating Performance Monitoring and Communication in Parallel Computers. A CM Sigmetrics Conf. on Measurement & Modeling of Comp. Systems, Philadelphia, PA, May, 1996. Google ScholarDigital Library
- 22.B.P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadarn, and Tia NewhaU. The Paradyn Performance Tools. IEEE Computer 28, 11, November 1995. Google ScholarDigital Library
- 23.S.K. Reinhardt, J. R. Larus, D. A. Wood. Typhoon and Tempest: User-Level Shared Memory. 21st Int'l Syrup. on Comp. Architecture, April 1994. Google ScholarDigital Library
- 24.D.J. Scales, K. Gharachorloo, and C. A. Thekkath. Shasta: A Low Overhead, Software-Only Approach for Supporting Fine- Grain Shared Memory. 8th lnt'l Conf. on Architectural Support for Programming Languages and Operating Sys. (ASPLOS), 1996. Google ScholarDigital Library
- 25.I. Schoinas, B. Falsafi, A. R. Lebeck, S. K. Reinhardt, J. R. Larus, D. A. Wood. Fine-grained Access Control for Distributed Shared Memory. In Pr 6th lnt'l Conf. on Architectural Support for Prog. Languages and Operating Sys. (ASPLOS), Oct. 1994. Google ScholarDigital Library
- 26.P. Stenstrom, M. Brorsson, and L. Sandberg. An Adaptive Cache Coherence Protocol for Optimized Migratory Sharing. 20th Annual Int'l Syrup. on Comp. Architecture, May 1993. Google ScholarDigital Library
- 27.Sun Mieroelectronics. UItraSPARC User's Manual. 1996.Google Scholar
- 28.Y. Zhou, L. Iftode, K. Li, J. P. Singh, B. R. Toonen, I. Shoinas, M.D. Hill and D. A. Wood. Relaxed Consistency and Coherence Granularity in DSM Systems: A Performance Evaluation. 6th A CM SIGPLAN Syrup. on Principles and Practice of Parallel Programming. Las Vegas, June 1997. Google ScholarDigital Library
Index Terms
- Shared-memory performance profiling
Recommendations
Shared-memory performance profiling
This paper describes a new approach to finding performance bottlenecks in shared-memory parallel programs and its embodiment in the Paradyn Parallel Performance Tools running with the Blizzard fine-grain distributed shared memory system. This approach ...
An Architecture for High-Performance Scalable Shared-Memory Multiprocessors Exploiting On-Chip Integration
Recent technology improvements allow multiprocessor designers to put some key components inside the processor chip, such as the memory controller, the coherence hardware, and the network interface/router. In this paper, we exploit such integration scale,...
Comments