skip to main content
research-article

Effective performance measurement and analysis of multithreaded applications

Published: 14 February 2009 Publication History

Abstract

Understanding why the performance of a multithreaded program does not improve linearly with the number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. This paper makes three contributions to performance analysis of multithreaded programs. First, we describe how to measure and attribute parallel idleness, namely, where threads are stalled and unable to work. This technique applies broadly to programming models ranging from explicit threading (e.g., Pthreads) to higher-level models such as Cilk and OpenMP. Second, we describe how to measure and attribute parallel overhead -- when a thread is performing miscellaneous work other than executing the user's computation. By employing a combination of compiler support and post-mortem analysis, we incur no measurement cost beyond normal profiling to glean this information. Using idleness and overhead metrics enables one to pinpoint areas of an application where concurrency should be increased (to reduce idleness), decreased (to reduce overhead), or where the present parallelization is hopeless (where idleness and overhead are both high). Third, we describe how to measure and attribute arbitrary performance metrics for high-level multithreaded programming models, such as Cilk. This requires bridging the gap between the expression of logical concurrency in programs and its realization at run-time as it is adaptively partitioned and scheduled onto a pool of threads. We have prototyped these ideas in the context of Rice University's HPCToolkit performance tools. We describe our approach, implementation, and experiences applying this approach to measure and attribute work, idleness, and overhead in executions of Cilk programs.

References

[1]
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. Mellor-Crummey, and N. R. Tallent. HPCToolkit: Tools for performance analysis of optimized parallel programs. Technical Report TR08-06, Rice University, 2008.
[2]
G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In SIGPLAN Conference on Programming Language Design and Implementation, pages 85--96, New York, NY, USA, 1997. ACM Press.
[3]
T. E. Anderson and E. D. Lazowska. Quartz: a tool for tuning parallel program performance. SIGMETRICS Perform. Eval. Rev., 18(1):115--125, 1990.
[4]
Apple Computer. Shark. http://developer.apple.com/tools/sharkoptimize.html.
[5]
W. Binder. Portable and accurate sampling profiling for Java. Softw. Pract. Exper., 36(6):615--650, 2006.
[6]
C. P. Breshears. Using Intel Thread Profiler for Win32 threads: Philosophy and theory. http://software.intel.com/en-us/articles/using-intel-thread-profiler-for-win32-. threads-philosophy-and-theory, August 2007.
[7]
D. R. Butenhof. Programming with POSIX threads. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1997.
[8]
M. E. Crovella and T. J. LeBlanc. Parallel performance using lost cycles analysis. In Supercomputing '94: Proceedings of the 1994 conference on Supercomputing, pages 600--609, Los Alamitos, CA, USA, 1994. IEEE Computer Society Press.
[9]
S. Eyerman, L. Eeckhout, T. Karkhanis, and J. E. Smith. A performance counter architecture for computing accurate CPI components. SIGPLAN Not., 41(11):175--184, 2006.
[10]
M. Frigo, C. E. Leiserson, and K. H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN '98 Conference on Programming Language Design and Implementation, pages 212--223, Montreal, Quebec, Canada, June 1998. Proceedings published ACM SIGPLAN Notices, Vol. 33, No. 5, May, 1998.
[11]
N. Froyd, J. Mellor-Crummey, and R. Fowler. Low-overhead call path profiling of unmodified, optimized code. In ICS '05: Proceedings of the 19th annual International Conference on Supercomputing, pages 81--90, New York, NY, USA, 2005. ACM Press.
[12]
R. J. Hall. Call path profiling. In ICSE '92: Proceedings of the 14th international conference on Software engineering, pages 296--306, New York, NY, USA, 1992. ACM Press.
[13]
Intel Corporation. Intel performance tuning utility. Linked from http://whatif.intel.com/.
[14]
Intel Corporation. Intel thread profiler. http://www.intel.com/software/products/tpwin.
[15]
Intel Corporation. Intel VTune performance analyzers. http://www.intel.com/software/products/vtune/.
[16]
M. Itzkowitz, O. Mazurov, N. Copty, and Y. Lin. An OpenMP runtime API for profiling. http://www.compunity.org/futures/omp-api.html.
[17]
D. Levinthal. Execution-based cycle accounting on Intel Core 2 Duo processors. http://www.devx.com/go-parallel/Link/33315.
[18]
J. Levon al. OProfile. http://oprofile.sourceforge.net/.
[19]
M. Monchiero, R. Canal, and A. Gonzalez. Power/performance/thermal design-space exploration for multicore architectures. IEEE Transactions on Parallel and Distributed Systems, 19(5):666--681, May 2008.
[20]
D. Mosberger-Tang. libunwind. http://www.nongnu.org/libunwind/.
[21]
T. Moseley, D. A. Connors, D. Grunwald, and R. Peri. Identifying potential parallelism via loop-centric profiling. In CF '07: Proceedings of the 4th international conference on Computing frontiers, pages 143--152, New York, NY, USA, 2007. ACM.
[22]
OpenMP Architecture Review Board. OpenMP application program interface, version 3.0. http://www.openmp.org/mp-documents/spec30.pdf, May 2008.
[23]
J. Reinders. Intel Threading Building Blocks. O'Reilly, Sebastopol, CA, 2007.
[24]
Rice University. HPCToolkit performance tools. http://hpctoolkit.org.
[25]
T. Yasue, T. Suganuma, H. Komatsu, and T. Nakatani. An efficient online path profiling framework for Java just-in-time compilers. In PACT '03: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, page 148, Washington, DC, USA, 2003. IEEE Computer Society.
[26]
X. Zhuang, M. J. Serrano, H. W. Cain, and J.-D. Choi. Accurate, efficient, and adaptive calling context profiling. In PLDI '06: Proceedings of the 2006 ACM SIGPLAN conference on Programming language design and implementation, pages 263--271, New York, NY, USA, 2006. ACM.

Cited By

View all
  • (2024)An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FXProceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3636480.3637094(7-16)Online publication date: 11-Jan-2024
  • (2024)LEA Block Cipher in Rust Language: Trade-off between Memory Safety and Performance2024 International Conference on Platform Technology and Service (PlatCon)10.1109/PlatCon63925.2024.10830717(166-171)Online publication date: 26-Aug-2024
  • (2024)Sensitivity of Automated SQL Grading in Computer Science CoursesProceedings of the Third International Conference on Innovations in Computing Research (ICR’24)10.1007/978-3-031-65522-7_26(283-299)Online publication date: 1-Aug-2024
  • Show More Cited By

Index Terms

  1. Effective performance measurement and analysis of multithreaded applications

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM SIGPLAN Notices
        ACM SIGPLAN Notices  Volume 44, Issue 4
        PPoPP '09
        April 2009
        294 pages
        ISSN:0362-1340
        EISSN:1558-1160
        DOI:10.1145/1594835
        Issue’s Table of Contents
        • cover image ACM Conferences
          PPoPP '09: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming
          February 2009
          322 pages
          ISBN:9781605583976
          DOI:10.1145/1504176
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 14 February 2009
        Published in SIGPLAN Volume 44, Issue 4

        Check for updates

        Author Tags

        1. call path profiling
        2. hpctoolkit
        3. multithreaded programming models
        4. performance analysis

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)58
        • Downloads (Last 6 weeks)2
        Reflects downloads up to 08 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)An Overview on Mixing MPI and OpenMP Dependent Tasking on A64FXProceedings of the International Conference on High Performance Computing in Asia-Pacific Region Workshops10.1145/3636480.3637094(7-16)Online publication date: 11-Jan-2024
        • (2024)LEA Block Cipher in Rust Language: Trade-off between Memory Safety and Performance2024 International Conference on Platform Technology and Service (PlatCon)10.1109/PlatCon63925.2024.10830717(166-171)Online publication date: 26-Aug-2024
        • (2024)Sensitivity of Automated SQL Grading in Computer Science CoursesProceedings of the Third International Conference on Innovations in Computing Research (ICR’24)10.1007/978-3-031-65522-7_26(283-299)Online publication date: 1-Aug-2024
        • (2023)ParallelC-Assist: Productivity Accelerator Suite Based on Dynamic InstrumentationIEEE Access10.1109/ACCESS.2023.329352511(73599-73612)Online publication date: 2023
        • (2023)A scheduling algorithm based on critical factors for heterogeneous multicore processorsConcurrency and Computation: Practice and Experience10.1002/cpe.7969Online publication date: 20-Nov-2023
        • (2021)The case for phase-aware scheduling of parallelizable jobsPerformance Evaluation10.1016/j.peva.2021.102246(102246)Online publication date: Oct-2021
        • (2021)A mathematical framework for design discovery from multi-threaded applications using neural sequence solversInnovations in Systems and Software Engineering10.1007/s11334-021-00393-8Online publication date: 22-Apr-2021
        • (2021)$$Dcube_{NN}$$: Tool for Dynamic Design Discovery from Multi-threaded Applications Using Neural Sequence ModelsAdvanced Computing and Systems for Security: Volume 1410.1007/978-981-16-4294-4_6(75-92)Online publication date: 22-Sep-2021
        • (2020)SoftMonProceedings of the 17th International Conference on Mining Software Repositories10.1145/3379597.3387444(397-408)Online publication date: 29-Jun-2020
        • (2020)GAPP: A Fast Profiler for Detecting Serialization Bottlenecks in Parallel Linux ApplicationsProceedings of the ACM/SPEC International Conference on Performance Engineering10.1145/3358960.3379136(257-264)Online publication date: 20-Apr-2020
        • Show More Cited By

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media