research-article

Harmony: collection and analysis of parallel block vectors

Authors:
Melanie Kambadur

Columbia University, New York, NY

Columbia University, New York, NY
View Profile

,
Kui Tang

Columbia University, New York, NY

Columbia University, New York, NY
View Profile

,
Martha A. Kim

Columbia University, New York, NY

Columbia University, New York, NY
View Profile

Authors Info & Claims

ACM SIGARCH Computer Architecture News Volume 40 Issue 3June 2012pp 452–463https://doi.org/10.1145/2366231.2337211

Published:09 June 2012Publication History

ACM SIGARCH Computer Architecture News

Abstract

Efficient execution of well-parallelized applications is central to performance in the multicore era. Program analysis tools support the hardware and software sides of this effort by exposing relevant features of multithreaded applications. This paper describes parallel block vectors, which uncover previously unseen characteristics of parallel programs. Parallel block vectors provide block execution profiles per concurrency phase (e.g., the block execution profile of all serial regions of a program). This information provides a direct and fine-grained mapping between an application's runtime parallel phases and the static code that makes up those phases. This paper also demonstrates how to collect parallel block vectors with minimal application perturbation using Harmony. Harmony is an instrumentation pass for the LLVM compiler that introduces just 16-21% overhead on average across eight Parsec benchmarks.

We apply parallel block vectors to uncover several novel insights about parallel applications with direct consequences for architectural design. First, that the serial and parallel phases of execution used in Amdahl's Law are often composed of many of the same basic blocks. Second, that program features, such as instruction mix, vary based on the degree of parallelism, with serial phases in particular displaying different instruction mixes from the program as a whole. Third, that dynamic execution frequencies do not necessarily correlate with a block's parallelism.

References

T. E. Anderson and E. D. Lazowska. Quartz: a tool for tuning parallel program performance. SIGMETRICS Performance Evaluation Review, 18:115--125, 1990. Google ScholarDigital Library
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience, 2010. Google ScholarDigital Library
M. Bach, M. Charney, R. Cohn, E. Demikhovsky, T. Devor, K. Hazelwood, A. Jaleel, C.-K. Luk, G. Lyons, H. Patil, and A. Tal. Analyzing parallel programs with Pin. Computer, 43(3):34--41, 2010. Google ScholarDigital Library
C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University, 2011. Google ScholarDigital Library
K. Frlinger and M. Gerndt. ompP: A profiling tool for OpenMP. In Proceedings of the International Workshop on OpenMP, 2005. Google ScholarDigital Library
S. Garcia, D. Jeon, C. M. Louie, and M. B. Taylor. Kremlin: rethinking and rebooting gprof for the multicore age. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 458--469, 2011. Google ScholarDigital Library
S. L. Graham, P. B. Kessler, and M. K. Mckusick. Gprof: A call graph execution profiler. SIGPLAN Notices, 17:120--126, 1982. Google ScholarDigital Library
J. Gummaraju, L. Morichetti, M. Houston, B. Sander, B. R. Gaster, and B. Zheng. Twin peaks: a software platform for heterogeneous computing on general-purpose and graphics processors. In Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 205--216, 2010. Google ScholarDigital Library
Y. He, C. E. Leiserson, and W. M. Leiserson. The Cilkview scalability analyzer. In Proceedings of the Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 145--156, 2010. Google ScholarDigital Library
O. Hernandez, R. C. Nanjegowda, B. Chapman, V. Bui, and R. Kufrin. Open source software support for the OpenMP runtime API for profiling. In Proceedings of the International Conference on Parallel Processing Workshops, ICPPW, pages 130--137, 2009. Google ScholarDigital Library
M. D. Hill and M. R. Marty. Amdahl's law in the multicore era. Computer, 41:33--38, 2008. Google ScholarDigital Library
Intel Corporation. Intel Parallel Amplifier 2011. http://software.intel.com/en-us/articles/intel-parallel-amplifier/.Google Scholar
Intel Corporation. Intel VTune Amplifier XE. http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/.Google Scholar
E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core fusion: accommodating software diversity in chip multiprocessors. In Proceedings of the International Symposium on Computer Architecture (ISCA), ISCA '07, pages 186--197, 2007. Google ScholarDigital Library
M. Itzkowitz and Y. Maruyama. HPC profiling with the SunStudio performance tools. In Parallel Tools Workshop, 2009.Google Scholar
A. Jimborean, M. Herrmann, V. Loechner, and P. Clauss. VMAD: a virtual machine for advanced dynamic analysis of programs. In IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS, 2011. Google ScholarDigital Library
D. Jones, Jr., S. Marlow, and S. Singh. Parallel performance tuning for haskell. In Proceedings of the 2nd ACM SIGPLAN Symposium on Haskell, Haskell, pages 81--92, 2009. Google ScholarDigital Library
C. Kim, S. Sethumadhavan, D. Gulati, D. Burger, M. Govindan, N. Ranganathan, and S. Keckler. Composable lightweight processors. In Proceedings of the Annual International Symposium on Microarchitecture (MICRO), pages 381--394, 2007. Google ScholarDigital Library
C. Lattner and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), pages 75--, 2004. Google ScholarDigital Library
A. D. Malony. Event-based performance perturbation: a case study. In Proceedings of the ACM SIGNPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 201--212, 1991. Google ScholarDigital Library
G. McLaren. QProf: a scalable profiler for the Q back end. MIT PhD Thesis, 1995.Google Scholar
T. Moseley, D. A. Connors, D. Grunwald, and R. Peri. Identifying potential parallelism via loop-centric profiling. In Proceedings of the International Conference on Computing Frontiers, CF, pages 143--152, 2007. Google ScholarDigital Library
E. B. Nightingale, O. Hodson, R. McIlroy, C. Hawblitzel, and G. Hunt. Helios: heterogeneous multiprocessing with satellite kernels. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles, pages 221--234, 2009. Google ScholarDigital Library
H. Patil, C. Pereira, M. Stallcup, G. Lueck, and J. Cownie. PinPlay: A framework for deterministic replay and reproducible analysis of parallel programs. In Proceedings of the International Symposium on Code Generation and Optimization (CGO), pages 2--11, 2010. Google ScholarDigital Library
A. Rane and J. Browne. Performance optimization of data structures using memory access characterization. In IEEE International Conference on Cluster Computing (CLUSTER), pages 570--574, 2011. Google ScholarDigital Library
V. J. Reddi, A. Settle, D. A. Connors, and R. S. Cohn. PIN: A binary instrumentation tool for computer architecture research and education. In Proceedings of the Workshop on Computer Architecture Education, WCAE, 2004. Google ScholarDigital Library
B. Saha, X. Zhou, H. Chen, Y. Gao, S. Yan, M. Rajagopalan, J. Fang, P. Zhang, R. Ronen, and A. Mendelson. Programming model for a heterogeneous x86 platform. SIGPLAN Notices, 44:431--440, 2009. Google ScholarDigital Library
K. Serebryany, A. Potapenko, T. Iskhodzhanov, and D. Vyukov. Dynamic race detection with the LLVM compiler, 2011.Google Scholar
S. S. Shende and A. D. Malony. The Tau parallel performance system. International Journal of High Performance Computing Applications, 20:287--311, 2006. Google ScholarDigital Library
T. Sherwood, E. Perelman, G. Hamerly, and B. Calder. Automatically characterizing large scale program behavior. SIGOPS Operating Systems Review, 36:45--57, 2002. Google ScholarDigital Library
M. D. Smith. Tracing with pixie. Technical Report CSL-TR-91-497, Department of Computer Science, Stanford University, 1991.Google Scholar
STMicroelectronics, Inc. PGProf: parallel profiling for scientists and engineers, 2011. http://www.pgroup.com/products/pgprof.htm.Google Scholar
N. R. Tallent and J. M. Mellor-Crummey. Effective performance measurement and analysis of multithreaded applications. SIGPLAN Notices, 44:229--240, 2009. Google ScholarDigital Library
Valgrind Developers. Cachegrind: a cache and branch-prediction profiler. http://valgrind.org/docs/manual/cg-manual.html.Google Scholar
G. Venkatesh, J. Sampson, N. Goulding, S. Garcia, V. Bryksin, J. Lugo-Martinez, S. Swanson, and M. Taylor. Conservation cores: reducing the energy of mature computations. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 205--218, 2010. Google ScholarDigital Library
J. Vetter and C. Chambreau. mpiP: Lightweight, Scalable MPI Profiling, 2011. http://mpip.sourceforge.net/.Google Scholar
H. Zhong, S. Lieberman, and S. Mahlke. Extending multicore architectures to exploit hybrid parallelism in single-thread applications. In Proceedings of the Symposium on High Performance Computer Architecture (HPCA), pages 25--36, 2007. Google ScholarDigital Library

Index Terms

Harmony: collection and analysis of parallel block vectors
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Index terms have been assigned to the content through auto-classification.

Recommendations

Harmony: collection and analysis of parallel block vectors
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture

Efficient execution of well-parallelized applications is central to performance in the multicore era. Program analysis tools support the hardware and software sides of this effort by exposing relevant features of multithreaded applications. This paper ...
Read More
Work-first and help-first scheduling policies for async-finish task parallelism
IPDPS '09: Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing

Multiple programming models are emerging to address an increased need for dynamic task parallelism in applications for multicore processors and shared-address-space parallel computing. Examples include OpenMP 3.0, Java Concurrency Utilities, Microsoft ...
Read More
OpenMP for Networks of SMPs

In this paper, we present the first system that implements OpenMP on a network of shared-memory multiprocessors. This system enables the programmer to rely on a single, standard, shared-memory API for parallelization within a multiprocessor and between ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGARCH Computer Architecture News Volume 40, Issue 3
ISCA '12
June 2012
559 pages
ISSN:0163-5964
DOI:10.1145/2366231
Issue’s Table of Contents
ISCA '12: Proceedings of the 39th Annual International Symposium on Computer Architecture
June 2012
584 pages
ISBN:9781450316422
General Chair:
Shih-Lien Lu
Intel
,
Program Chair:
Josep Torrellas
University of Illinois
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2012
Check for updates
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 362
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Harmony: collection and analysis of parallel block vectors

ACM SIGARCH Computer Architecture News

Abstract

References

Cited By

Index Terms

Recommendations

Harmony: collection and analysis of parallel block vectors

Work-first and help-first scheduling policies for async-finish task parallelism

OpenMP for Networks of SMPs