ABSTRACT
Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it is to access memory attached to another processor. Without careful distribution of computation and data, a multithreaded program running on such a system may have high average memory access latency. To use multiprocessor systems efficiently, programmers need performance tools to guide the design of NUMA-aware codes. To address this need, we enhanced the HPCToolkit performance tools to support measurement and analysis of performance problems on multiprocessor systems with multiple NUMA domains. With these extensions, HPCToolkit helps pinpoint, quantify, and analyze NUMA bottlenecks in executions of multithreaded programs. It computes derived metrics to assess the severity of bottlenecks, analyzes memory accesses, and provides a wealth of information to guide NUMA optimization, including information about how to distribute data to reduce access latency and minimize contention. This paper describes the design and implementation of our extensions to HPCToolkit. We demonstrate their utility by describing case studies in which we use these capabilities to diagnose NUMA bottlenecks in four multithreaded applications.
- L. Adhianto et al. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22:685--701, 2010. Google ScholarDigital Library
- Advanced Micro Devices. AMD CodeAnalyst performance analyzer. http://developer.amd.com/tools-and-sdks/heterogeneous-computing/archive%d-tools/amd-codeanalyst-performance-analyzer/. Last accessed: Jan. 6, 2013.Google Scholar
- C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011. Google ScholarDigital Library
- F. Broquedis, N. Furmento, B. Goglin, P.-A. Wacrenier, and R. Namyst. ForestGOMP: An efficient OpenMP environment for NUMA architectures. Intl. Journal of Parallel Programming, 38(5--6):418--439, 2010.Google ScholarCross Ref
- H.-P. Corporation. Perfmon kernel interface. http://perfmon2.sourceforge.net/. Last accessed: Dec. 12, 2013.Google Scholar
- A. Cox and R. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with PLATINUM. In Proc. of the 12th ACM Symp. on Operating Systems Principles, SOSP '89, pages 32--44, New York, NY, USA, 1989. Google ScholarDigital Library
- M. Dashti et al. Traffic management: a holistic approach to memory placement on NUMA systems. In Proc. of the 18th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 381--394, New York, NY, USA, 2013. Google ScholarDigital Library
- L. DeRose, B. Homer, D. Johnson, S. Kaufmann, and H. Poxon. Cray performance analysis tools. In Tools for High Performance Computing, pages 191--199. Springer Berlin Heidelberg, 2008.Google ScholarCross Ref
- P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. http://developer.amd.com/Assets/AMD_IBS_paper_EN.pdf, November 2007. Last accessed: Dec. 13, 2013.Google Scholar
- IBM Corporation. IBM Visual Performance Analyzer User Guide, version 6.2. http://bit.ly/ibm-vpa-62. Last accessed: Dec. 12, 2013.Google Scholar
- Intel VTune Amplifier XE 2013. http://software.intel.com/en-us/intel-vtune-amplifier-xe, April 2013. Last accessed: Dec. 12, 2013.Google Scholar
- Intel Corporation. Intel 64 and IA-32 architectures software developer's manual, Volume 3B: System programming guide, Part 2, Number 253669-032, June 2010.Google Scholar
- Intel Corporation. Intel Itanium Processor 9300 series reference manual for software development and optimization, Number 323602-001, March 2010.Google Scholar
- A. Kleen. A NUMA API for Linux. http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf, 2005. Last accessed: Dec. 12, 2013.Google Scholar
- R. Lachaize, B. Lepers, and V. Quéma. MemProf: a memory profiler for NUMA multicore systems. In Proc. of the 2012 USENIX Annual Technical Conf., USENIX ATC'12, Berkeley, CA, USA, 2012. Google ScholarDigital Library
- Lawrence Livermore National Laboratory. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://codesign.llnl.gov/lulesh.php. Last accessed: Dec. 12, 2013.Google Scholar
- Lawrence Livermore National Laboratory. LLNL Coral Benchmarks. https://asc.llnl.gov/CORAL-benchmarks. Last accessed: Dec. 12, 2013.Google Scholar
- Lawrence Livermore National Laboratory. LLNL Sequoia Benchmarks. https://asc.llnl.gov/sequoia/benchmarks. Last accessed: Dec. 12, 2013.Google Scholar
- X. Liu and J. Mellor-Crummey. Pinpointing data locality problems using data-centric analysis. In Proc. of the 9th IEEE/ACM Intl. Symp. on Code Generation and Optimization, pages 171--180, Washington, DC, 2011. Google ScholarDigital Library
- X. Liu and J. Mellor-Crummey. Pinpointing data locality bottlenecks with low overheads. In Proc. of the 2013 IEEE Intl. Symp. on Performance Analysis of Systems and Software, Austin, TX, USA, April 21--23, 2013.Google ScholarCross Ref
- X. Liu and J. M. Mellor-Crummey. A data-centric profiler for parallel programs. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing, Denver, CO, USA, 2013. Google ScholarDigital Library
- LLVM Compiler Infrastructure. http://www.llvm.org. Last accessed: Jan. 7, 2013.Google Scholar
- Z. Majo and T. R. Gross. Matching memory access patterns and data placement for NUMA systems. In Proc. of the 10th IEEE/ACM Intl. Symp. on Code Generation and Optimization, pages 230--241, New York, NY, USA, 2012. Google ScholarDigital Library
- C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In Proc. of 2010 IEEE Intl. Symp. on Performance Analysis of Systems Software, pages 87--96, Mar. 2010.Google ScholarCross Ref
- A. Rane and J. Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In Proc. of the $12^th$ IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA, 2012. Google ScholarDigital Library
- Rogue Wave Software. ThreadSpotter manual, version 2012.1. http://www.roguewave.com/documents.aspx?Command=Core_Download&EntryId=1%492, August 2012. Last accessed: Dec. 12, 2013.Google Scholar
- S. Shende and A. D. Malony. The TAU parallel performance system. International Journal of High Performance Computing Applications, ACTS Collection Special Issue, 2005. Google ScholarDigital Library
- B.-W. Silas et al. Corey: an operating system for many cores. In Proc. of the 8th USENIX conference on Operating Systems Design and Implementation, pages 43--57, Berkeley, CA, USA, 2008. Google ScholarDigital Library
- B. Sinharoy et al. IBM POWER7 multicore server processor. IBM JRD, 55(3):1:1--29, May 2011. Google ScholarDigital Library
- M. Srinivas et al. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD, 55(3):4:1--19, May/June 2011.Google ScholarCross Ref
- V. Weaver. The unofficial Linux Perf Events web-page. http://web.eece.maine.edu/ vweaver/projects/perf_events. Last accessed: Dec. 12, 2013.Google Scholar
- R. Yang et al. Profiling directed NUMA optimization on Linux systems: A case study of the Gaussian computational chemistry code. In Proc. of the 2011 IEEE Intl. Parallel & Distributed Processing Symposium, pages 1046--1057, 2011. Google ScholarDigital Library
Index Terms
- A tool to analyze the performance of multithreaded programs on NUMA architectures
Recommendations
A tool to analyze the performance of multithreaded programs on NUMA architectures
PPoPP '14Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it ...
Scale-out NUMA
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systemsEmerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as ...
Scale-out NUMA
ASPLOS '14Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as ...
Comments