research-article

A tool to analyze the performance of multithreaded programs on NUMA architectures

Authors:
Xu Liu

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
John Mellor-Crummey

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programmingFebruary 2014Pages 259–272https://doi.org/10.1145/2555243.2555271

Published:06 February 2014Publication History

PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

Pages 259–272

ABSTRACT

Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it is to access memory attached to another processor. Without careful distribution of computation and data, a multithreaded program running on such a system may have high average memory access latency. To use multiprocessor systems efficiently, programmers need performance tools to guide the design of NUMA-aware codes. To address this need, we enhanced the HPCToolkit performance tools to support measurement and analysis of performance problems on multiprocessor systems with multiple NUMA domains. With these extensions, HPCToolkit helps pinpoint, quantify, and analyze NUMA bottlenecks in executions of multithreaded programs. It computes derived metrics to assess the severity of bottlenecks, analyzes memory accesses, and provides a wealth of information to guide NUMA optimization, including information about how to distribute data to reduce access latency and minimize contention. This paper describes the design and implementation of our extensions to HPCToolkit. We demonstrate their utility by describing case studies in which we use these capabilities to diagnose NUMA bottlenecks in four multithreaded applications.

References

L. Adhianto et al. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience, 22:685--701, 2010. Google ScholarDigital Library
Advanced Micro Devices. AMD CodeAnalyst performance analyzer. http://developer.amd.com/tools-and-sdks/heterogeneous-computing/archive%d-tools/amd-codeanalyst-performance-analyzer/. Last accessed: Jan. 6, 2013.Google Scholar
C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, January 2011. Google ScholarDigital Library
F. Broquedis, N. Furmento, B. Goglin, P.-A. Wacrenier, and R. Namyst. ForestGOMP: An efficient OpenMP environment for NUMA architectures. Intl. Journal of Parallel Programming, 38(5--6):418--439, 2010.Google ScholarCross Ref
H.-P. Corporation. Perfmon kernel interface. http://perfmon2.sourceforge.net/. Last accessed: Dec. 12, 2013.Google Scholar
A. Cox and R. Fowler. The implementation of a coherent memory abstraction on a NUMA multiprocessor: experiences with PLATINUM. In Proc. of the 12th ACM Symp. on Operating Systems Principles, SOSP '89, pages 32--44, New York, NY, USA, 1989. Google ScholarDigital Library
M. Dashti et al. Traffic management: a holistic approach to memory placement on NUMA systems. In Proc. of the 18th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, ASPLOS '13, pages 381--394, New York, NY, USA, 2013. Google ScholarDigital Library
L. DeRose, B. Homer, D. Johnson, S. Kaufmann, and H. Poxon. Cray performance analysis tools. In Tools for High Performance Computing, pages 191--199. Springer Berlin Heidelberg, 2008.Google ScholarCross Ref
P. J. Drongowski. Instruction-based sampling: A new performance analysis technique for AMD family 10h processors. http://developer.amd.com/Assets/AMD_IBS_paper_EN.pdf, November 2007. Last accessed: Dec. 13, 2013.Google Scholar
IBM Corporation. IBM Visual Performance Analyzer User Guide, version 6.2. http://bit.ly/ibm-vpa-62. Last accessed: Dec. 12, 2013.Google Scholar
Intel VTune Amplifier XE 2013. http://software.intel.com/en-us/intel-vtune-amplifier-xe, April 2013. Last accessed: Dec. 12, 2013.Google Scholar
Intel Corporation. Intel 64 and IA-32 architectures software developer's manual, Volume 3B: System programming guide, Part 2, Number 253669-032, June 2010.Google Scholar
Intel Corporation. Intel Itanium Processor 9300 series reference manual for software development and optimization, Number 323602-001, March 2010.Google Scholar
A. Kleen. A NUMA API for Linux. http://developer.amd.com/wordpress/media/2012/10/LibNUMA-WP-fv1.pdf, 2005. Last accessed: Dec. 12, 2013.Google Scholar
R. Lachaize, B. Lepers, and V. Quéma. MemProf: a memory profiler for NUMA multicore systems. In Proc. of the 2012 USENIX Annual Technical Conf., USENIX ATC'12, Berkeley, CA, USA, 2012. Google ScholarDigital Library
Lawrence Livermore National Laboratory. Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH). https://codesign.llnl.gov/lulesh.php. Last accessed: Dec. 12, 2013.Google Scholar
Lawrence Livermore National Laboratory. LLNL Coral Benchmarks. https://asc.llnl.gov/CORAL-benchmarks. Last accessed: Dec. 12, 2013.Google Scholar
Lawrence Livermore National Laboratory. LLNL Sequoia Benchmarks. https://asc.llnl.gov/sequoia/benchmarks. Last accessed: Dec. 12, 2013.Google Scholar
X. Liu and J. Mellor-Crummey. Pinpointing data locality problems using data-centric analysis. In Proc. of the 9th IEEE/ACM Intl. Symp. on Code Generation and Optimization, pages 171--180, Washington, DC, 2011. Google ScholarDigital Library
X. Liu and J. Mellor-Crummey. Pinpointing data locality bottlenecks with low overheads. In Proc. of the 2013 IEEE Intl. Symp. on Performance Analysis of Systems and Software, Austin, TX, USA, April 21--23, 2013.Google ScholarCross Ref
X. Liu and J. M. Mellor-Crummey. A data-centric profiler for parallel programs. In Proc. of the 2013 ACM/IEEE Conference on Supercomputing, Denver, CO, USA, 2013. Google ScholarDigital Library
LLVM Compiler Infrastructure. http://www.llvm.org. Last accessed: Jan. 7, 2013.Google Scholar
Z. Majo and T. R. Gross. Matching memory access patterns and data placement for NUMA systems. In Proc. of the 10th IEEE/ACM Intl. Symp. on Code Generation and Optimization, pages 230--241, New York, NY, USA, 2012. Google ScholarDigital Library
C. McCurdy and J. S. Vetter. Memphis: Finding and fixing NUMA-related performance problems on multi-core platforms. In Proc. of 2010 IEEE Intl. Symp. on Performance Analysis of Systems Software, pages 87--96, Mar. 2010.Google ScholarCross Ref
A. Rane and J. Browne. Enhancing performance optimization of multicore chips and multichip nodes with data structure metrics. In Proc. of the $12^th$ IEEE Intl. Conf. on Parallel Architectures and Compilation Techniques, Minneapolis, MN, USA, 2012. Google ScholarDigital Library
Rogue Wave Software. ThreadSpotter manual, version 2012.1. http://www.roguewave.com/documents.aspx?Command=Core_Download&EntryId=1%492, August 2012. Last accessed: Dec. 12, 2013.Google Scholar
S. Shende and A. D. Malony. The TAU parallel performance system. International Journal of High Performance Computing Applications, ACTS Collection Special Issue, 2005. Google ScholarDigital Library
B.-W. Silas et al. Corey: an operating system for many cores. In Proc. of the 8th USENIX conference on Operating Systems Design and Implementation, pages 43--57, Berkeley, CA, USA, 2008. Google ScholarDigital Library
B. Sinharoy et al. IBM POWER7 multicore server processor. IBM JRD, 55(3):1:1--29, May 2011. Google ScholarDigital Library
M. Srinivas et al. IBM POWER7 performance modeling, verification, and evaluation. IBM JRD, 55(3):4:1--19, May/June 2011.Google ScholarCross Ref
V. Weaver. The unofficial Linux Perf Events web-page. http://web.eece.maine.edu/ vweaver/projects/perf_events. Last accessed: Dec. 12, 2013.Google Scholar
R. Yang et al. Profiling directed NUMA optimization on Linux systems: A case study of the Gaussian computational chemistry code. In Proc. of the 2011 IEEE Intl. Parallel & Distributed Processing Symposium, pages 1046--1057, 2011. Google ScholarDigital Library

Index Terms

A tool to analyze the performance of multithreaded programs on NUMA architectures

Recommendations

A tool to analyze the performance of multithreaded programs on NUMA architectures
PPoPP '14

Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it ...
Read More
Scale-out NUMA
ASPLOS '14: Proceedings of the 19th international conference on Architectural support for programming languages and operating systems

Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as ...
Read More
Scale-out NUMA
ASPLOS '14

Emerging datacenter applications operate on vast datasets that are kept in DRAM to minimize latency. The large number of servers needed to accommodate this massive memory footprint requires frequent server-to-server communication in applications such as ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2014
412 pages
ISBN:9781450326568
DOI:10.1145/2555243
General Chair:
José Moreira
IBM Research, USA
,
Program Chair:
James Larus
EPFL, Switzerland
ACM SIGPLAN Notices Volume 49, Issue 8
PPoPP '14
August 2014
390 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2692916
Editors:
Mark W. Bailey
Hamilton College, Clinton, NY
,
Rajeev Balasubramonian
University of Utah
,
Al Davis
University of Utah
,
Sarita Adve
University of Illinois at Urbana-Champ
Issue’s Table of Contents
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 February 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
NUMA
memory access pattern
performance optimization
profiler
threads
Qualifiers
- research-article
Conference

Acceptance Rates
PPoPP '14 Paper Acceptance Rate28of184submissions,15%Overall Acceptance Rate230of1,014submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 60
  Total Citations
  View Citations
- 1,266
  Total Downloads
- Downloads (Last 12 months)58
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A tool to analyze the performance of multithreaded programs on NUMA architectures

PPoPP '14: Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

A tool to analyze the performance of multithreaded programs on NUMA architectures

Scale-out NUMA

Scale-out NUMA