skip to main content
10.1145/2555243.2555271acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

A tool to analyze the performance of multithreaded programs on NUMA architectures

Published:06 February 2014Publication History

ABSTRACT

Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it is to access memory attached to another processor. Without careful distribution of computation and data, a multithreaded program running on such a system may have high average memory access latency. To use multiprocessor systems efficiently, programmers need performance tools to guide the design of NUMA-aware codes. To address this need, we enhanced the HPCToolkit performance tools to support measurement and analysis of performance problems on multiprocessor systems with multiple NUMA domains. With these extensions, HPCToolkit helps pinpoint, quantify, and analyze NUMA bottlenecks in executions of multithreaded programs. It computes derived metrics to assess the severity of bottlenecks, analyzes memory accesses, and provides a wealth of information to guide NUMA optimization, including information about how to distribute data to reduce access latency and minimize contention. This paper describes the design and implementation of our extensions to HPCToolkit. We demonstrate their utility by describing case studies in which we use these capabilities to diagnose NUMA bottlenecks in four multithreaded applications.

References

  1. L. Adhianto et al. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22:685--701, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Advanced Micro Devices. AMD CodeAnalyst performance analyzer. http://developer.amd.com/tools-and-sdks/heterogeneous-computing/archive%d-tools/amd-codeanalyst-performance-analyzer/. Last accessed: Jan. 6, 2013.Google ScholarGoogle Scholar
  3. C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Broquedis, N. Furmento, B. Goglin, P.-A. Wacrenier, and R. Namyst. ForestGOMP: An efficient OpenMP environment for NUMA architectures. Intl. Journal of Parallel Programming, 38(5--6):418--439, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  5. H.-P. Corporation. Perfmon kernel interface. http://perfmon2.sourceforge.net/. Last accessed: Dec. 12, 2013.Google ScholarGoogle Scholar
  6. A. Cox and R. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with PLATINUM. In Proc. of the 12th ACM Symp. on Operating Systems Principles, SOSP '89, pages 32--44, New York, NY, USA, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Dashti et al. Traffic management: a holistic approach to memory placement on NUMA systems. In Proc. of the 18th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 381--394, New York, NY, USA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. DeRose, B. Homer, D. Johnson, S. Kaufmann, and H. Poxon. Cray performance analysis tools. In Tools for High Performance Computing, pages 191--199. Springer Berlin Heidelberg, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  9. P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. http://developer.amd.com/Assets/AMD_IBS_paper_EN.pdf, November 2007. Last accessed: Dec. 13, 2013.Google ScholarGoogle Scholar
  10. IBM Corporation. IBM Visual Performance Analyzer User Guide, version 6.2. http://bit.ly/ibm-vpa-62. Last accessed: Dec. 12, 2013.Google ScholarGoogle Scholar
  11. Intel VTune Amplifier XE 2013. http://software.intel.com/en-us/intel-vtune-amplifier-xe, April 2013. Last accessed: Dec. 12, 2013.Google ScholarGoogle Scholar
  12. Intel Corporation. Intel 64 and IA-32 architectures software developer's manual, Volume 3B: System programming guide, Part 2, Number 253669-032, June 2010.Google ScholarGoogle Scholar
  13. Intel Corporation. Intel Itanium Processor 9300 series reference manual for software development and optimization, Number 323602-001, March 2010.Google ScholarGoogle Scholar
  14. A. Kleen. A NUMA API for Linux. http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf, 2005. Last accessed: Dec. 12, 2013.Google ScholarGoogle Scholar
  15. R. Lachaize, B. Lepers, and V. Quéma. MemProf: a memory profiler for NUMA multicore systems. In Proc. of the 2012 USENIX Annual Technical Conf., USENIX ATC'12, Berkeley, CA, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lawrence Livermore National Laboratory. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://codesign.llnl.gov/lulesh.php. Last accessed: Dec. 12, 2013.Google ScholarGoogle Scholar
  17. Lawrence Livermore National Laboratory. LLNL Coral Benchmarks. https://asc.llnl.gov/CORAL-benchmarks. Last accessed: Dec. 12, 2013.Google ScholarGoogle Scholar
  18. Lawrence Livermore National Laboratory. LLNL Sequoia Benchmarks. https://asc.llnl.gov/sequoia/benchmarks. Last accessed: Dec. 12, 2013.Google ScholarGoogle Scholar
  19. X. Liu and J. Mellor-Crummey. Pinpointing data locality problems using data-centric analysis. In Proc. of the 9th IEEE/ACM Intl. Symp. on Code Generation and Optimization, pages 171--180, Washington, DC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. X. Liu and J. Mellor-Crummey. Pinpointing data locality bottlenecks with low overheads. In Proc. of the 2013 IEEE Intl. Symp. on Performance Analysis of Systems and Software, Austin, TX, USA, April 21--23, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  21. X. Liu and J. M. Mellor-Crummey. A data-centric profiler for parallel programs. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing, Denver, CO, USA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. LLVM Compiler Infrastructure. http://www.llvm.org. Last accessed: Jan. 7, 2013.Google ScholarGoogle Scholar
  23. Z. Majo and T. R. Gross. Matching memory access patterns and data placement for NUMA systems. In Proc. of the 10th IEEE/ACM Intl. Symp. on Code Generation and Optimization, pages 230--241, New York, NY, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In Proc. of 2010 IEEE Intl. Symp. on Performance Analysis of Systems Software, pages 87--96, Mar. 2010.Google ScholarGoogle ScholarCross RefCross Ref
  25. A. Rane and J. Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In Proc. of the $12^th$ IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Rogue Wave Software. ThreadSpotter manual, version 2012.1. http://www.roguewave.com/documents.aspx?Command=Core_Download&EntryId=1%492, August 2012. Last accessed: Dec. 12, 2013.Google ScholarGoogle Scholar
  27. S. Shende and A. D. Malony. The TAU parallel performance system. International Journal of High Performance Computing Applications, ACTS Collection Special Issue, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. B.-W. Silas et al. Corey: an operating system for many cores. In Proc. of the 8th USENIX conference on Operating Systems Design and Implementation, pages 43--57, Berkeley, CA, USA, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. B. Sinharoy et al. IBM POWER7 multicore server processor. IBM JRD, 55(3):1:1--29, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Srinivas et al. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD, 55(3):4:1--19, May/June 2011.Google ScholarGoogle ScholarCross RefCross Ref
  31. V. Weaver. The unofficial Linux Perf Events web-page. http://web.eece.maine.edu/ vweaver/projects/perf_events. Last accessed: Dec. 12, 2013.Google ScholarGoogle Scholar
  32. R. Yang et al. Profiling directed NUMA optimization on Linux systems: A case study of the Gaussian computational chemistry code. In Proc. of the 2011 IEEE Intl. Parallel & Distributed Processing Symposium, pages 1046--1057, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A tool to analyze the performance of multithreaded programs on NUMA architectures

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
              February 2014
              412 pages
              ISBN:9781450326568
              DOI:10.1145/2555243

              Copyright © 2014 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 6 February 2014

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              PPoPP '14 Paper Acceptance Rate28of184submissions,15%Overall Acceptance Rate230of1,014submissions,23%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader