skip to main content
article
Open access

Interaction cost and shotgun profiling

Published: 01 September 2004 Publication History

Abstract

We observe that the challenges software optimizers and microarchitects face every day boil down to a single problem: bottleneck analysis. A bottleneck is any event or resource that contributes to execution time, such as a critical cache miss or window stall. Tasks such as tuning processors for energy efficiency and finding the right loads to prefetch all require measuring the performance costs of bottlenecks.In the past, simple event counts were enough to find the important bottlenecks. Today, the parallelism of modern processors makes such analysis much more difficult, rendering traditional performance counters less useful. If two microarchitectural events (such as a fetch stall and a cache miss) occur in the same cycle, which event should we blame for the cycle? What cost should we assign to each event? In this paper, we introduce a new model for understanding event costs to facilitate processor design and optimization.First, we observe that all instructions, hardware structures, and events in a machine can interact in only one of two ways (in parallel or serially). We quantify these interactions by defining interaction cost, which can be zero (independent, no interaction), positive (parallel), or negative (serial).Second, we illustrate the value of using interaction costs in processor design and optimization. In a processor with a long pipeline, we show how to mitigate the negative performance effect of long latency "critical" loops, such as the level-one cache access and issue-wakeup, by optimizing seemingly unrelated resources that interact with them.Finally, we propose shotgun profiling, a class of hardware profiling infrastructures that are parallelism-aware, in contrast to traditional event counters. Our recommended design requires only modest extensions to current hardware counters, while enabling the construction of full-featured dependence graphs of the microexecution. With these dependence graphs, many types of analyses can be performed, including identifying critical instructions, finding slack, as well as computing costs and interaction costs.

References

[1]
Anderson, J. M., Berc, L. M., Dean, J., Ghemawat, S., Henzinger, M. R., Leung, S. A., Sites, R. L., Vandevoorde, M. T., Waldspurger, C. A., and Weihl, W. E. 1997. Continuous profiling: Where have all the cycles gone? ACM Trans. Comput. Syst.
[2]
Bahar, R. I. and Manne, S. 2001. Power and energy reduction via pipeline balancing. In 28th International Symposium on Computer Architecture.
[3]
Borch, E., Tune, E., Manne, B., and Emer, J. 2002. Loose loops sink chips. In 8th International Symposium on High-Performance Computer Architecture.
[4]
Boyd, E. L. and Davidson, E. S. 1993. Hierarchical performance modeling with MACS: A case study of the Convex C-240. In 20th International Symposium on Computer Architecture.
[5]
Burger, D. C. and Austin, T. M. 1997. The Simplescalar Tool Set, version 2.0. Tech. Rep., CS-TR-1997--1342, University of Wisconsin, Madison.
[6]
Calder, B., Reinman, G., and Tullsen, D. 1999. Selective value prediction. In 26th International Symposium on Computer Architecture.
[7]
Casmira, J. and Grunwald, D. 2000. Dynamic instruction scheduling slack. In Kool Chips Workshop in Conjunction with MICRO 33.
[8]
Collins, J. D., Wang, H., Tullsen, D. M., Hughes, C., Lee, Y., Lavery, D., and Shen, J. P. 2001. Speculative precomputation: Long-range prefetching of delinquent loads. In 28th International Symposium on Computer Architecture.
[9]
Dean, J., Hicks, J. E., Waldspurger, C. A., Weihl, W. E., and Chrysos, G. 1997. ProfileMe: Hardware support for instruction-level profiling on out-of-order processors. In 30th International Symposium on Microarchitecture.
[10]
Fahs, B., Bose, S., Crum, M., Slechta, B., Spadini, F., Tung, T., Patel, S. J., and Lumetta, S. S. 2001. Performance characterization of a hardware mechanism for dynamic optimization. In 34th International Symposium on Microarchitecture.
[11]
Fields, B., Bodík, R., and Hill, M. D. 2002. Slack: Maximizing performance under technological constraints. In 29th International Symposium on Computer Architecture.
[12]
Fields, B., Rubin, S., and Bodík, R. 2001. Focusing processor policies via critical-path prediction. In 28th International Symposium on Computer Architecture.
[13]
Fisk, B. R. and Bahar, R. I. 1999. The non-critical buffer: Using load latency tolerance to improve data cache efficiency.
[14]
Fleischmann, R. D. et al. 1995. Whole-genome random sequencing and assembly of haemophilus-influenzae. Science 269, 496--512.
[15]
Hartstein, A. and Puzak, T. R. 2002. The optimum pipeline depth for a microprocessor. In 29th International Symposium on Computer Architecture.
[16]
Hennessy, J. L. and Patterson, D. A. 2002. Computer Architecture: A Quantitative Approach, 3 rd ed. Morgan Kaufmann Publishers, Los Altos, CA.
[17]
Hrishikesh, M. S., Jouppi, N. P., Farkas, K. I., Burger, D., Keckler, S. W., and Shivakumar, P. 2002. The optimal logic depth per pipeline stage is 6 to 8 FO4 inverter delays. In 29th International Symposium on Computer Architecture.
[18]
Intel. 2003. Intel Pentium 4 Processor Manual. Available at http://developer.intel.com/design/ pentium4/manuals/.
[19]
Jain, R. 1991. The Art of Computer Systems Performance Analysis. Wiley Professional Computing.
[20]
Karkhanis, T. and Smith, J. E. 2004. A first-order superscalar processor model. In 31st International Symposium on Computer Architecture.
[21]
Lipasti, M. H. and Shen, J. P. 1996. Exceeding the dataflow limit via value prediction. In 29th International Symposium on Microarchitecture.
[22]
Pai, V. S., Ranganathan, P., and Adve, S. V. 1997. The impact of instruction-level parallelism on multiprocessor performance and simulation methodology. In 3rd International Symposium on High Performance Computer Architecture.
[23]
Patel, S., Evers, M., and Patt, Y. 1998. Improving trace cache effectiveness with branch promotion and trace packing. In 25th International Symposium on Computer Architecture.
[24]
Rajwar, R. and Goodman, J. R. 2001. Speculative lock elision: Enabling highly concurrent multithreaded execution. In 34th International Symposium on Microarchitecture.
[25]
Rakvic, R., Black, B., Limaye, D., and Shen, J. P. 2002. Non-vital loads. In 8th International Symposium on High-Performance Computer Architecture.
[26]
Ranganathan, P., Gharachorloo, K., Adve, S. V., and Barroso, L. A. 1998. Performance of database workloads on shared-memory systems with out-of-order processors.
[27]
Rosenblum, M., Bugnion, E., Herrod, S. A., Witchel, E., and Gupta, A. 1995. The impact of architectural trends on operating system performance. In 15th Symposium on Operating Systems Principles.
[28]
Roth, A. and Sohi, G. 2000. Speculative Data-Driven Sequencing for Imperative Programs. Tech. Rep. CS-TR-2000-1411, University of Wisconsin, Madison.
[29]
Sasanka, R., Hughes, C. J., and Adve, S. V. 2002. Joint local and global hardware adaptations for energy. In 10th International Conference on Architectural Support for Programming Languages and Operating Systems.
[30]
Semeraro, G., Magklis, G., Balasubramonian, R., Albonesi, D., Dwarkadas, S., and Scott, M. 2002. Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling. In 8th International Symposium on High-Performance Computer Architecture.
[31]
Seng, J. S., Tune, E. S., and Tullsen, D. M. 2001. Reducing power with dynamic critical path information. In 34th International Symposium on Microarchitecture.
[32]
Sodani, A. and Sohi, G. S. 1997. Dynamic instruction reuse. In 24th International Symposium on Computer Architecture.
[33]
Sprangle, E. and Carmean, D. 2002. Increasing processor performance by implementing deeper pipelines. In 29th International Symposium on Computer Architecture.
[34]
Sprunt, B. 2002. Pentium 4 performance-monitoring features. IEEE Micro, July.
[35]
Srinivasan, S. T., ching Ju, R. D., Lebeck, A. R., and Wilkerson, C. 2001. Locality vs. criticality. In 28th International Symposium on Computer Architecture.
[36]
Srinivasan, S. T. and Lebeck, A. R. 1998. Load latency tolerance in dynamically scheduled processors. In 31st International Symposium on Microarchitecture.
[37]
Steffan, J. G., Colohan, C. B., Zhai, A., and Mowry, T. C. 2000. A scalable approach to thread-level speculation. In 27th International Symposium on Computer Architecture.
[38]
Tune, E., Liang, D., Tullsen, D. M., and Calder, B. 2001. Dynamic prediction of critical path instructions. In 7th International Symposium on High-Performance Computer Architecture.
[39]
Tune, E., Tullsen, D., and Calder, B. 2002. Quantifying instruction criticality. In 11th International Conference on Parallel Architectures and Compilation Techniques.
[40]
Yi, J. J., Lilja, D. J., and Hawkins, D. M. 2003. A statistically rigorous approach for improving simulation methodology. In 9th International Symposium on High Performance Computer Architecture.
[41]
Zagha, M., Larson, B., Turner, S., and Itzkowitz, M. 1996. Performance analysis using the MIPS R10000 performance counters. In Supercomputing '96.
[42]
Zilles, C. and Sohi, G. 2001. Execution-based prediction using speculative slices. In Proceedings of the 28th International Symposium on Computer Architecture.

Cited By

View all
  • (2024)A dependence graph pattern mining method for processor performance analysisPerformance Evaluation10.1016/j.peva.2024.102409164(102409)Online publication date: May-2024
  • (2023)CCS: A Motif-Based Storage Format for Micro-execution Dependence GraphAdvanced Parallel Processing Technologies10.1007/978-981-99-7872-4_8(130-146)Online publication date: 4-Aug-2023
  • (2022)MemGaze: Rapid and Effective Load-Level Memory Trace Analysis2022 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER51413.2022.00058(484-495)Online publication date: Sep-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Architecture and Code Optimization
ACM Transactions on Architecture and Code Optimization  Volume 1, Issue 3
September 2004
121 pages
ISSN:1544-3566
EISSN:1544-3973
DOI:10.1145/1022969
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 September 2004
Published in TACO Volume 1, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Performance analysis
  2. critical path
  3. modeling
  4. profiling

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)82
  • Downloads (Last 6 weeks)13
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A dependence graph pattern mining method for processor performance analysisPerformance Evaluation10.1016/j.peva.2024.102409164(102409)Online publication date: May-2024
  • (2023)CCS: A Motif-Based Storage Format for Micro-execution Dependence GraphAdvanced Parallel Processing Technologies10.1007/978-981-99-7872-4_8(130-146)Online publication date: 4-Aug-2023
  • (2022)MemGaze: Rapid and Effective Load-Level Memory Trace Analysis2022 IEEE International Conference on Cluster Computing (CLUSTER)10.1109/CLUSTER51413.2022.00058(484-495)Online publication date: Sep-2022
  • (2018)RpStacks-MTProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00054(586-599)Online publication date: 20-Oct-2018
  • (2018)Multi-Stage CPI StacksIEEE Computer Architecture Letters10.1109/LCA.2017.276175117:1(55-58)Online publication date: 1-Jan-2018
  • (2018)Extending the Performance Analysis Tool Box: Multi-stage CPI Stacks and FLOPS Stacks2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)10.1109/ISPASS.2018.00031(179-188)Online publication date: Apr-2018
  • (2017)Dependence Graph Model for Accurate Critical Path Analysis on Out-of-Order ProcessorsJournal of Information Processing10.2197/ipsjjip.25.98325(983-992)Online publication date: 2017
  • (2017)Representative paths analysisProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis10.1145/3126908.3126962(1-12)Online publication date: 12-Nov-2017
  • (2016)Exploiting Core Criticality for Enhanced GPU PerformanceACM SIGMETRICS Performance Evaluation Review10.1145/2964791.290146844:1(351-363)Online publication date: 14-Jun-2016
  • (2016)Exploiting Core Criticality for Enhanced GPU PerformanceProceedings of the 2016 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Science10.1145/2896377.2901468(351-363)Online publication date: 14-Jun-2016
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media