research-article

Core monitors: monitoring performance in multicore processors

Authors:
Paul E. West

Florida State University, Tallahassee, FL, USA

Florida State University, Tallahassee, FL, USA
View Profile

,
Yuval Peress

Floridat State University, Tallahassee, FL, USA

Floridat State University, Tallahassee, FL, USA
View Profile

,
Gary S. Tyson

Florida State University, Tallahassee, FL, USA

Florida State University, Tallahassee, FL, USA
View Profile

,
Sally A. McKee

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA
View Profile

CF '09: Proceedings of the 6th ACM conference on Computing frontiersMay 2009Pages 31–40https://doi.org/10.1145/1531743.1531751

Published:18 May 2009Publication History

CF '09: Proceedings of the 6th ACM conference on Computing frontiers

Pages 31–40

ABSTRACT

As we reach the limits of single-core computing, we are promised more and more cores in our systems. Modern architectures include many performance counters per core, but few or no inter-core counters. In fact, performance counters were not designed to be exploited by users, as they now are, but simply as aids for hardware debugging and testing during system creation. As such, they tend to be an "after thought" in the design, with no standardization across or within platforms. Nonetheless, given access to these counters, researchers are using them to great advantage [17]. Furthermore, evaluating counters for multicore systems has become a complex and resource consuming task. We propose a Performance Monitoring System consisting of a specialized CPU core designed to allow efficient collection and evaluation of performance data for both static and dynamic optimizations. Our system provides a transparent mechanism to change architectural features dynamically, inform the Operating System of process behaviors, and assist in profiling and debugging. For instance, a piece of hardware watching snoop packets can determine when a write-update cache coherence protocol would be helpful or detrimental to the currently running program. Our system is designed to allow the hardware to feed performance statistics back to software, allowing dynamic architectural adjustments at runtime.

References

S. B. Pentium 4 performance-monitoring features. IEEE Micro, 22(4):72--82, Jul/Aug 2002. Google ScholarDigital Library
W. Binder. Portable and accurate sampling profiling for java. Softw. Pract. Exper., 36(6):615--650, 2006. Google ScholarDigital Library
N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The m5 simulator: Modeling networked systems. IEEE Micro, 26(4):52--60, 2006. Google ScholarDigital Library
K. Chow and Y. Wu. Feedback-directed selection and characterization of compiler optimizations. 2nd Workshop on Feedback Directed Optimization, 1999.Google Scholar
Compaq. Alpha architecture handbook. whitpaper, October 1998.Google Scholar
J. Dean, J. Hicks, C. Waldspurger, W. Weihl, and G. Chrysos. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In Proc. IEEE/ACM 30th International Symposium on Microarchitecture, pages 292--302, Dec. 1997. Google ScholarDigital Library
J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos. ProfileMe: hardware support for instruction-level profiling on out-of-order processors. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 292--302, Washington, DC, USA, 1997. IEEE Computer Society. Google ScholarDigital Library
G. Delzanno. Automatic verification of parameterized cache coherence protocols. In Computer Aided Verification, pages 53--68, Dec. 2006. Google ScholarDigital Library
B. Fields, R. Bodik, M. Hill, and C. Newburn. Using interaction costs for microarchitectural bottleneck analysis. In Proc. IEEE/ACM 36th International Symposium on Microarchitecture, pages 228--239, Dec. 2003. Google ScholarDigital Library
H. Grahn and P. Stenstrom. Evaluation of a competitive-update cache coherence protocol with migratory data detection. J. Parallel Distrib. Comput., 39(2):168--180, 1996. Google ScholarDigital Library
T. Heil and J. E. Smith. Relational profiling: Enable thread-level paralelism in virtual machines. Microarchitecture, IEEE/ACM International Symposium on, 0:281, 2000. Google ScholarDigital Library
M. Helms, T. Bochner, R. Fritz, T. Schlipf, and M. Walz. Event monitoring in a system-on-a-chip. In Proc. 12th Annual IEEE International ASIC/SOC Conference, Sept. 1999.Google ScholarCross Ref
R. Hockauf, J. Jeitner, W. Karl, R. Lindhof, M. Schulz, V. Gonzales, E. Sanquis, and G. Torralba. Design and implementation aspects for the SMiLE hardware monitor. In G. Horn and W. Karl, editors, Proc. of SCI-Europe 2000, The 3rd International Conference on SCI-Based Technology and Research, pages 47--55. SINTEF Electronics and Cybernetics, Aug. 2000. ISBN: 82-595-9964-3, Also available at http://wwwbode.in.tum.de/events/.Google Scholar
Intel. Intel Itanium Architecture Software Developer's Manual, 2000.Google Scholar
Intel. Intel Architecture Software Developer's Manual Volume 3: System Programming Guide, 2002.Google Scholar
W. Karl, M. Leberecht, and M. Schulz. Optimizing data locality for SCI-based PC-clusters with the SMiLE monitoring approach. In Proc. of International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 169--176, Oct. 1999. Google ScholarDigital Library
M. Martonosi, D. W. Clark, and M. Mesarina. The SHRIMP performance monitor: Design and applications. In ACM SIGMETRICS Performance Evaluation Review, pages 61--69, May 1996. Google ScholarDigital Library
M. Martonosi, D. Ofelt, and M. Heinrich. Integrating performance monitoring and communication in parallel computers. In Proc.ACM International Conference on Measurement and Modeling of Computer Systems, pages 138--147, May 1996. Google ScholarDigital Library
T. Mu, J. Tao, M. Schulz, and S. McKee. Interactive locality optimization on NUMA architectures. In Proc. ACM 2003 Symposium on Software Visualization (SoftVis), pages 133--142,214, July 2003. Google ScholarDigital Library
A. Nanda, K. Mak, K. Sugavanam, R. Sahoo, V. Soundararajan, and T. Smith. MemorIES: a programmable, real-time hardware emulation tool for multiprocessor server design. SIGPLAN Not., 35(11):37--48, 2000. Google ScholarDigital Library
M. Prvulovic and J. Torrellas. Reenact: Using thread-level speculation mechanisms to debug data races in multithreaded codes. In Proc. 30th IEEE/ACM International Symposium on Computer Architecture, pages 110--121, June 2003. Google ScholarDigital Library
V. Salapura. Bluegene/p performance counters. Personal Communication: Paper in Submission, Nov. 2007.Google Scholar
V. Salapura, K. Ganesan, A. Gara, M. Gschwind, J. Sexton, and R. Walkup. Next-generation performance counters: Towards monitoring over thousand concurrent events. Performance Analysis of Systems and software, 2008. ISPASS 2008. IEEE International Symposium on, pages 139--146, April 2008. Google ScholarDigital Library
S. Sarangi, A. Tiwari, and J. Torrellas. Phoenix: Detecting and recovering from permanent processor design bugs with programmable hardware. In Proc. IEEE/ACM 40th Annual International Symposium on Microarchitecture, pages 26--37, Dec. 2006. Google ScholarDigital Library
S. Sastry, R. Bodík, and J. Smith. Rapid profiling via stratified sampling. In Proc. 28th IEEE/ACM International Symposium on Computer Architecture, pages 278--289, July 2001. Google ScholarDigital Library
M. Schulz, B. White, S. McKee, H. Lee, and J. Jeitner. Owl: Next generation system monitoring. In Proc. ACM Computing Frontiers Conference, May 2005. Google ScholarDigital Library
B. Sprunt. The basics of performance--monitoring hardware. IEEE Micro, pages 64--71, July/August 2002. Google ScholarDigital Library
B. Sprunt. Pentium 4 performance-monitoring features. IEEE Micro, pages 72--82, July/August 2002. Google ScholarDigital Library
M. Xu, R. Bodik, and M. Hill. A flight data recorder for enabling full-system multiprocessor deterministic replay. In Proc. 30th IEEE/ACM International Symposium on Computer Architecture, pages 122--135, June 2003. Google ScholarDigital Library
P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve for memory management. In Proc. 11th ACM Symposium on Architectural Support for Programming Languages and Operating Systems, pages 177--188, Oct. 2004. Google ScholarDigital Library
P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iwatcher: efficient architectural support for software debugging. Computer Architecture, 2004. Proceedings. 31st Annual International Symposium on, pages 224--235, June 2004. Google ScholarDigital Library
P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iWatcher: Efficient architectural support for software de-bugging. In Proc. 31st IEEE/ACM International Symposium on Computer Architecture, pages 224--237, June 2004. Google ScholarDigital Library

Index Terms

Core monitors: monitoring performance in multicore processors
1. Computer systems organization
  1. Architectures
    1. Other architectures
2. Hardware
  1. Hardware validation

Recommendations

Fine tuning matrix multiplications on multicore
HiPC'08: Proceedings of the 15th international conference on High performance computing

Multicore systems are becoming ubiquituous in scientificcomputing. As performance libraries are adapted to such systems, thedifficulty to extract the best performance out of them is quite high. Indeed,performance libraries such as Intel's MKL, while ...
Read More
Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

Sparse matrix vector multiplication (SpMV) is an important computational kernel in traditional high-performance computing and emerging data-intensive applications. Previous SpMV libraries are optimized by either application-specific or architecture-...
Read More
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Euro-Par 2009

In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CF '09: Proceedings of the 6th ACM conference on Computing frontiers
May 2009
238 pages
ISBN:9781605584133
DOI:10.1145/1531743
General Chairs:
Gearold Johnson
Colorado State University, USA
,
Cartsen Trinitis
TU München, Germany
,
Program Chairs:
Georgi N. Gaydadjiev
TU Delft, The Nederland
,
Alex Veidenbaum
University of California, USA
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 May 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cache coherency
debugging
multicore
performance monitoring
profiling
realtime
scheduling
Qualifiers
- research-article
Conference

Acceptance Rates
CF '09 Paper Acceptance Rate26of113submissions,23%Overall Acceptance Rate240of680submissions,35%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 565
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Core monitors: monitoring performance in multicore processors

CF '09: Proceedings of the 6th ACM conference on Computing frontiers

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fine tuning matrix multiplications on multicore

Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Core monitors: monitoring performance in multicore processors

CF '09: Proceedings of the 6th ACM conference on Computing frontiers

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fine tuning matrix multiplications on multicore

Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media