research-article

Effective performance measurement and analysis of multithreaded applications

Authors:

Nathan R. Tallent,

John M. Mellor-CrummeyAuthors Info & Claims

ACM SIGPLAN Notices, Volume 44, Issue 4

Pages 229 - 240

https://doi.org/10.1145/1594835.1504210

Published: 14 February 2009 Publication History

Abstract

Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three contributions to performance analysis of multithreaded programs. First, we describe how to measure and attribute parallel idleness, namely, where threads are stalled and unable to work. This technique applies broadly to programming models ranging from explicit threading (e.g., Pthreads) to higher-level models such as Cilk and OpenMP. Second, we describe how to measure and attribute parallel overhead -- when a thread is performing miscellaneous work other than executing the user's computation. By employing a combination of compiler support and post-mortem analysis, we incur no measurement cost beyond normal profiling to glean this information. Using idleness and overhead metrics enables one to pinpoint areas of an application where concurrency should be increased (to reduce idleness), decreased (to reduce overhead), or where the present parallelization is hopeless (where idleness and overhead are both high). Third, we describe how to measure and attribute arbitrary performance metrics for high-level multithreaded programming models, such as Cilk. This requires bridging the gap between the expression of logical concurrency in programs and its realization at run-time as it is adaptively partitioned and scheduled onto a pool of threads. We have prototyped these ideas in the context of Rice University's HPCToolkit performance tools. We describe our approach, implementation, and experiences applying this approach to measure and attribute work, idleness, and overhead in executions of Cilk programs.

References

[1]

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Technical Report TR08-06, Rice University, 2008.

[2]

G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In SIGPLAN Conference on Programming Language Design and Implementation, pages 85--96, New York, NY, USA, 1997. ACM Press.

Digital Library

[3]

T. E. Anderson and E. D. Lazowska. Quartz: a tool for tuning parallel program performance. SIGMETRICS Perform. Eval. Rev., 18(1):115--125, 1990.

Digital Library

[4]

Apple Computer. Shark. http://developer.apple.com/tools/sharkoptimize.html.

[5]

W. Binder. Portable and accurate sampling profiling for Java. Softw. Pract. Exper., 36(6):615--650, 2006.

[6]

C. P. Breshears. Using Intel Thread Profiler for Win32 threads: Philosophy and theory. http://software.intel.com/en-us/articles/using-intel-thread-profiler-for-win32-. threads-philosophy-and-theory, August 2007.

[7]

D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.

Digital Library

[8]

M. E. Crovella and T. J. LeBlanc. Parallel performance using lost cycles analysis. In Supercomputing '94: Proceedings of the 1994 conference on Supercomputing, pages 600--609, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.

Digital Library

[9]

S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A performance counter architecture for computing accurate CPI components. SIGPLAN Not., 41(11):175--184, 2006.

Digital Library

[10]

M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, pages 212--223, Montreal, Quebec, Canada, June 1998. Proceedings published ACM SIGPLAN Notices, Vol. 33, No. 5, May, 1998.

Digital Library

[11]

N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In ICS '05: Proceedings of the 19th annual International Conference on Supercomputing, pages 81--90, New York, NY, USA, 2005. ACM Press.

Digital Library

[12]

R. J. Hall. Call path profiling. In ICSE '92: Proceedings of the 14th international conference on Software engineering, pages 296--306, New York, NY, USA, 1992. ACM Press.

Digital Library

[13]

Intel Corporation. Intel performance tuning utility. Linked from http://whatif.intel.com/.

[14]

Intel Corporation. Intel thread profiler. http://www.intel.com/software/products/tpwin.

[15]

Intel Corporation. Intel VTune performance analyzers. http://www.intel.com/software/products/vtune/.

[16]

M. Itzkowitz, O. Mazurov, N. Copty, and Y. Lin. An OpenMP runtime API for profiling. http://www.compunity.org/futures/omp-api.html.

[17]

D. Levinthal. Execution-based cycle accounting on Intel Core 2 Duo processors. http://www.devx.com/go-parallel/Link/33315.

[18]

J. Levon al. OProfile. http://oprofile.sourceforge.net/.

[19]

M. Monchiero, R. Canal, and A. Gonzalez. Power/performance/thermal design-space exploration for multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 19(5):666--681, May 2008.

Digital Library

[20]

D. Mosberger-Tang. libunwind. http://www.nongnu.org/libunwind/.

[21]

T. Moseley, D. A. Connors, D. Grunwald, and R. Peri. Identifying potential parallelism via loop-centric profiling. In CF '07: Proceedings of the 4th international conference on Computing frontiers, pages 143--152, New York, NY, USA, 2007. ACM.

Digital Library

[22]

OpenMP Architecture Review Board. OpenMP application program interface, version 3.0. http://www.openmp.org/mp-documents/spec30.pdf, May 2008.

[23]

J. Reinders. Intel Threading Building Blocks. O'Reilly, Sebastopol, CA, 2007.

Digital Library

[24]

Rice University. HPCToolkit performance tools. http://hpctoolkit.org.

[25]

T. Yasue, T. Suganuma, H. Komatsu, and T. Nakatani. An efficient online path profiling framework for Java just-in-time compilers. In PACT '03: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, page 148, Washington, DC, USA, 2003. IEEE Computer Society.

Digital Library

[26]

X. Zhuang, M. J. Serrano, H. W. Cain, and J.-D. Choi. Accurate, efficient, and adaptive calling context profiling. In PLDI '06: Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, pages 263--271, New York, NY, USA, 2006. ACM.

Digital Library

Cited By

Pereira RRoussel ATsuji MCarribault PSato MMurai HGautier T(2024)An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FXProceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3636480.3637094(7-16)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3636480.3637094
Kim SEum SSong MSeo H(2024)LEA Block Cipher in Rust Language: Trade-off between Memory Safety and Performance2024 International Conference on Platform Technology and Service (PlatCon)10.1109/PlatCon63925.2024.10830717(166-171)Online publication date: 26-Aug-2024
https://doi.org/10.1109/PlatCon63925.2024.10830717
Wanjiru Bvan Bommel PHiemstra D(2024)Sensitivity of Automated SQL Grading in Computer Science CoursesProceedings of the Third International Conference on Innovations in Computing Research (ICR’24)10.1007/978-3-031-65522-7_26(283-299)Online publication date: 1-Aug-2024
https://doi.org/10.1007/978-3-031-65522-7_26
Show More Cited By

Index Terms

Effective performance measurement and analysis of multithreaded applications
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Effective performance measurement and analysis of multithreaded applications
PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming

Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three ...
Analyzing lock contention in multithreaded applications
PPoPP '10: Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify ...
Analyzing lock contention in multithreaded applications
PPoPP '10

Many programs exploit shared-memory parallelism using multithreading. Threaded codes typically use locks to coordinate access to shared data. In many cases, contention for locks reduces parallel efficiency and hurts scalability. Being able to quantify ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 44, Issue 4

PPoPP '09

April 2009

294 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/1594835

Issue’s Table of Contents

PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
February 2009
322 pages
ISBN:9781605583976
DOI:10.1145/1504176
General Chair:
Daniel Reed
Microsoft Research, USA
,
Program Chair:
Vivek Sarkar
Rice University, USA

Copyright © 2009 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 February 2009

Published in SIGPLAN Volume 44, Issue 4

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

106
Total Citations
View Citations
1,252
Total Downloads

Downloads (Last 12 months)58
Downloads (Last 6 weeks)2

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pereira RRoussel ATsuji MCarribault PSato MMurai HGautier T(2024)An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FXProceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3636480.3637094(7-16)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3636480.3637094
Kim SEum SSong MSeo H(2024)LEA Block Cipher in Rust Language: Trade-off between Memory Safety and Performance2024 International Conference on Platform Technology and Service (PlatCon)10.1109/PlatCon63925.2024.10830717(166-171)Online publication date: 26-Aug-2024
https://doi.org/10.1109/PlatCon63925.2024.10830717
Wanjiru Bvan Bommel PHiemstra D(2024)Sensitivity of Automated SQL Grading in Computer Science CoursesProceedings of the Third International Conference on Innovations in Computing Research (ICR’24)10.1007/978-3-031-65522-7_26(283-299)Online publication date: 1-Aug-2024
https://doi.org/10.1007/978-3-031-65522-7_26
Chatterjee NMajumdar SDas PChakrabarti A(2023)ParallelC-Assist: Productivity Accelerator Suite Based on Dynamic InstrumentationIEEE Access10.1109/ACCESS.2023.329352511(73599-73612)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3293525
Li CLin ZTian LZhang B(2023)A scheduling algorithm based on critical factors for heterogeneous multicore processorsConcurrency and Computation: Practice and Experience10.1002/cpe.7969Online publication date: 20-Nov-2023
https://doi.org/10.1002/cpe.7969
Berg BWhitehouse JMoseley BWang WHarchol-Balter M(2021)The case for phase-aware scheduling of parallelizable jobsPerformance Evaluation10.1016/j.peva.2021.102246(102246)Online publication date: Oct-2021
https://doi.org/10.1016/j.peva.2021.102246
Majumdar SChatterjee NDas PChakrabarti A(2021)A mathematical framework for design discovery from multi-threaded applications using neural sequence solversInnovations in Systems and Software Engineering10.1007/s11334-021-00393-8Online publication date: 22-Apr-2021
https://doi.org/10.1007/s11334-021-00393-8
Majumdar SChatterjee NPratim Das PChakrabarti A(2021)$$Dcube_{NN}$$: Tool for Dynamic Design Discovery from Multi-threaded Applications Using Neural Sequence ModelsAdvanced Computing and Systems for Security: Volume 1410.1007/978-981-16-4294-4_6(75-92)Online publication date: 22-Sep-2021
https://doi.org/10.1007/978-981-16-4294-4_6
Singh SSarangi S(2020)SoftMonProceedings of the 17th International Conference on Mining Software Repositories10.1145/3379597.3387444(397-408)Online publication date: 29-Jun-2020
https://dl.acm.org/doi/10.1145/3379597.3387444
Nair RField TAmaral JKoziolek ATrubiani CIosup A(2020)GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux ApplicationsProceedings of the ACM/SPEC International Conference on Performance Engineering10.1145/3358960.3379136(257-264)Online publication date: 20-Apr-2020
https://dl.acm.org/doi/10.1145/3358960.3379136
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents