ABSTRACT
As we reach the limits of single-core computing, we are promised more and more cores in our systems. Modern architectures include many performance counters per core, but few or no inter-core counters. In fact, performance counters were not designed to be exploited by users, as they now are, but simply as aids for hardware debugging and testing during system creation. As such, they tend to be an "after thought" in the design, with no standardization across or within platforms. Nonetheless, given access to these counters, researchers are using them to great advantage [17]. Furthermore, evaluating counters for multicore systems has become a complex and resource consuming task. We propose a Performance Monitoring System consisting of a specialized CPU core designed to allow efficient collection and evaluation of performance data for both static and dynamic optimizations. Our system provides a transparent mechanism to change architectural features dynamically, inform the Operating System of process behaviors, and assist in profiling and debugging. For instance, a piece of hardware watching snoop packets can determine when a write-update cache coherence protocol would be helpful or detrimental to the currently running program. Our system is designed to allow the hardware to feed performance statistics back to software, allowing dynamic architectural adjustments at runtime.
- S. B. Pentium 4 performance-monitoring features. IEEE Micro, 22(4):72--82, Jul/Aug 2002. Google ScholarDigital Library
- W. Binder. Portable and accurate sampling profiling for java. Softw. Pract. Exper., 36(6):615--650, 2006. Google ScholarDigital Library
- N. L. Binkert, R. G. Dreslinski, L. R. Hsu, K. T. Lim, A. G. Saidi, and S. K. Reinhardt. The m5 simulator: Modeling networked systems. IEEE Micro, 26(4):52--60, 2006. Google ScholarDigital Library
- K. Chow and Y. Wu. Feedback-directed selection and characterization of compiler optimizations. 2nd Workshop on Feedback Directed Optimization, 1999.Google Scholar
- Compaq. Alpha architecture handbook. whitpaper, October 1998.Google Scholar
- J. Dean, J. Hicks, C. Waldspurger, W. Weihl, and G. Chrysos. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In Proc. IEEE/ACM 30th International Symposium on Microarchitecture, pages 292--302, Dec. 1997. Google ScholarDigital Library
- J. Dean, J. E. Hicks, C. A. Waldspurger, W. E. Weihl, and G. Chrysos. ProfileMe: hardware support for instruction-level profiling on out-of-order processors. In MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture, pages 292--302, Washington, DC, USA, 1997. IEEE Computer Society. Google ScholarDigital Library
- G. Delzanno. Automatic verification of parameterized cache coherence protocols. In Computer Aided Verification, pages 53--68, Dec. 2006. Google ScholarDigital Library
- B. Fields, R. Bodik, M. Hill, and C. Newburn. Using interaction costs for microarchitectural bottleneck analysis. In Proc. IEEE/ACM 36th International Symposium on Microarchitecture, pages 228--239, Dec. 2003. Google ScholarDigital Library
- H. Grahn and P. Stenstrom. Evaluation of a competitive-update cache coherence protocol with migratory data detection. J. Parallel Distrib. Comput., 39(2):168--180, 1996. Google ScholarDigital Library
- T. Heil and J. E. Smith. Relational profiling: Enable thread-level paralelism in virtual machines. Microarchitecture, IEEE/ACM International Symposium on, 0:281, 2000. Google ScholarDigital Library
- M. Helms, T. Bochner, R. Fritz, T. Schlipf, and M. Walz. Event monitoring in a system-on-a-chip. In Proc. 12th Annual IEEE International ASIC/SOC Conference, Sept. 1999.Google ScholarCross Ref
- R. Hockauf, J. Jeitner, W. Karl, R. Lindhof, M. Schulz, V. Gonzales, E. Sanquis, and G. Torralba. Design and implementation aspects for the SMiLE hardware monitor. In G. Horn and W. Karl, editors, Proc. of SCI-Europe 2000, The 3rd International Conference on SCI-Based Technology and Research, pages 47--55. SINTEF Electronics and Cybernetics, Aug. 2000. ISBN: 82-595-9964-3, Also available at http://wwwbode.in.tum.de/events/.Google Scholar
- Intel. Intel Itanium Architecture Software Developer's Manual, 2000.Google Scholar
- Intel. Intel Architecture Software Developer's Manual Volume 3: System Programming Guide, 2002.Google Scholar
- W. Karl, M. Leberecht, and M. Schulz. Optimizing data locality for SCI-based PC-clusters with the SMiLE monitoring approach. In Proc. of International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 169--176, Oct. 1999. Google ScholarDigital Library
- M. Martonosi, D. W. Clark, and M. Mesarina. The SHRIMP performance monitor: Design and applications. In ACM SIGMETRICS Performance Evaluation Review, pages 61--69, May 1996. Google ScholarDigital Library
- M. Martonosi, D. Ofelt, and M. Heinrich. Integrating performance monitoring and communication in parallel computers. In Proc.ACM International Conference on Measurement and Modeling of Computer Systems, pages 138--147, May 1996. Google ScholarDigital Library
- T. Mu, J. Tao, M. Schulz, and S. McKee. Interactive locality optimization on NUMA architectures. In Proc. ACM 2003 Symposium on Software Visualization (SoftVis), pages 133--142,214, July 2003. Google ScholarDigital Library
- A. Nanda, K. Mak, K. Sugavanam, R. Sahoo, V. Soundararajan, and T. Smith. MemorIES: a programmable, real-time hardware emulation tool for multiprocessor server design. SIGPLAN Not., 35(11):37--48, 2000. Google ScholarDigital Library
- M. Prvulovic and J. Torrellas. Reenact: Using thread-level speculation mechanisms to debug data races in multithreaded codes. In Proc. 30th IEEE/ACM International Symposium on Computer Architecture, pages 110--121, June 2003. Google ScholarDigital Library
- V. Salapura. Bluegene/p performance counters. Personal Communication: Paper in Submission, Nov. 2007.Google Scholar
- V. Salapura, K. Ganesan, A. Gara, M. Gschwind, J. Sexton, and R. Walkup. Next-generation performance counters: Towards monitoring over thousand concurrent events. Performance Analysis of Systems and software, 2008. ISPASS 2008. IEEE International Symposium on, pages 139--146, April 2008. Google ScholarDigital Library
- S. Sarangi, A. Tiwari, and J. Torrellas. Phoenix: Detecting and recovering from permanent processor design bugs with programmable hardware. In Proc. IEEE/ACM 40th Annual International Symposium on Microarchitecture, pages 26--37, Dec. 2006. Google ScholarDigital Library
- S. Sastry, R. Bodík, and J. Smith. Rapid profiling via stratified sampling. In Proc. 28th IEEE/ACM International Symposium on Computer Architecture, pages 278--289, July 2001. Google ScholarDigital Library
- M. Schulz, B. White, S. McKee, H. Lee, and J. Jeitner. Owl: Next generation system monitoring. In Proc. ACM Computing Frontiers Conference, May 2005. Google ScholarDigital Library
- B. Sprunt. The basics of performance--monitoring hardware. IEEE Micro, pages 64--71, July/August 2002. Google ScholarDigital Library
- B. Sprunt. Pentium 4 performance-monitoring features. IEEE Micro, pages 72--82, July/August 2002. Google ScholarDigital Library
- M. Xu, R. Bodik, and M. Hill. A flight data recorder for enabling full-system multiprocessor deterministic replay. In Proc. 30th IEEE/ACM International Symposium on Computer Architecture, pages 122--135, June 2003. Google ScholarDigital Library
- P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou, and S. Kumar. Dynamic tracking of page miss ratio curve for memory management. In Proc. 11th ACM Symposium on Architectural Support for Programming Languages and Operating Systems, pages 177--188, Oct. 2004. Google ScholarDigital Library
- P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iwatcher: efficient architectural support for software debugging. Computer Architecture, 2004. Proceedings. 31st Annual International Symposium on, pages 224--235, June 2004. Google ScholarDigital Library
- P. Zhou, F. Qin, W. Liu, Y. Zhou, and J. Torrellas. iWatcher: Efficient architectural support for software de-bugging. In Proc. 31st IEEE/ACM International Symposium on Computer Architecture, pages 224--237, June 2004. Google ScholarDigital Library
Index Terms
- Core monitors: monitoring performance in multicore processors
Recommendations
Fine tuning matrix multiplications on multicore
HiPC'08: Proceedings of the 15th international conference on High performance computingMulticore systems are becoming ubiquituous in scientificcomputing. As performance libraries are adapted to such systems, thedifficulty to extract the best performance out of them is quite high. Indeed,performance libraries such as Intel's MKL, while ...
Design and Implementation of Adaptive SpMV Library for Multicore and Many-Core Architecture
Sparse matrix vector multiplication (SpMV) is an important computational kernel in traditional high-performance computing and emerging data-intensive applications. Previous SpMV libraries are optimized by either application-specific or architecture-...
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
Euro-Par 2009In the field of HPC, the current hardware trend is to design multiprocessor architectures featuring heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE) or data-parallel accelerators (e.g. GPUs). Approaching the theoretical ...
Comments