skip to main content
10.1145/2491894.2464160acmconferencesArticle/Chapter ViewAbstractPublication PagesismmConference Proceedingsconference-collections
research-article

Rigorous benchmarking in reasonable time

Published: 20 June 2013 Publication History

Abstract

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repetition costs time to complete experiments. Currently, many evaluations give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable.
In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision.
We present our methodology as a cookbook that guides researchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.

Supplementary Material

p63-kalibera original submission (p63-kalibera.pdf)
Originally published version of "Rigorous benchmarking in reasonable time"

References

[1]
M. Arnold, M. Hind, and B. G. Ryder. Online feedback-directed optimization of Java. In Proceedings of the 17th annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2002.
[2]
S. Basu and A. DasGupta. Robustness of standard confidence intervals for location parameters under departure from normality. Annals of Statistics, 23(4):1433--1442, 1995.
[3]
S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21nd annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), pages 169--190. ACM, 2006.
[4]
W. G. Cochran. Sampling Techniques: Third Edition. Wiley, 1977.
[5]
R. Coe. It's the effect size, stupid: What effect size is and why it is important. In Annual Conference of the British Educational Research Association (BERA), 2002.
[6]
J. Cohen. The Earth is round (pverb <.05). American Psychologist, 49(12):997--1003, 1994.
[7]
C. Curtsinger and E. D. Berger. Stabilizer: Statistically sound performance evaluation. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS). ACM, 2013.
[8]
E. C. Fieller. Some problems in interval estimation. Journal of the Royal Statistical Society, 16(2):175--185, 1954.
[9]
A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous Java performance evaluation. In Proceedings of the 22nd annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2007.
[10]
A. Georges, L. Eeckhout, and D. Buytaert. Java performance evaluation through rigorous replay compilation. In Proceedings of the 23rd ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2008.
[11]
D. Gu, C. Verbrugge, and E. Gagnon. Code layout as a source of noise in JVM performance. In Component And Middleware Performance Workshop, OOPSLA, 2004.
[12]
C. Hill and B. Thompson. Computing and interpreting effect sizes. In Higher Education: Handbook of Theory and Research, volume 19, pages 175--196. Springer, 2005.
[13]
R. Jain. The Art of Computer Systems Performance Analysis. Wiley, 1991.
[14]
T. Kalibera and R. E. Jones. Quantifying performance changes with effect size confidence intervals. Technical Report 4--12, University of Kent, 2012.
[15]
T. Kalibera and P. Tuma. Precise regression benchmarking with random effects: Improving Mono benchmark results. In Proceedings of Third European Performance Engineering Workshop (EPEW), volume 4054 of LNCS. Springer, 2006.
[16]
D. J. Lilja. Measuring Computer Performance: A Practitioner's Guide. Cambridge University Press, 2000.
[17]
Y. Luo and L. K. John. Efficiently evaluating speedup using sampled processor simulation. IEEE Computer Architecture Letters, 3(1):6--6, 2004.
[18]
S. E. Maxwell and H. D. Delaney. Designing Experiments and Analyzing Data: a Model Comparison Perspective. Routledge, 2004.
[19]
C. E. McCulloch, S. R. Searle, and J. M. Neuhaus. Generalized, Linear, and Mixed Models. Wiley, 2008.
[20]
T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data without doing anything obviously wrong! In Proceeding of the 14th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2009.
[21]
S. Nakagawa and I. C. Cuthill. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews, 82(4):591--605, 2007.
[22]
K. Ogata, T. Onodera, K. Kawachiya, H. Komatsu, and T. Nakatani. Replay compilation: Improving debuggability of a just-in-time compiler. In Proceedings of the 21st annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2006.
[23]
M. E. Payton, M. H. Greenstone, and N. Schenker. Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? Journal of Insect Science, 3(1996), 2003.
[24]
D. Rasch and V. Guiard. The robustness of parametric statistical methods. Psychology Science, 46(2):175--208, 2004.
[25]
R. M. Royall. The effect of sample size on the meaning of significance tests. American Statistician, 40(4):313--315, 1986.
[26]
S. R. Searle, G. Casella, and C. E. McCulloch. Variance Components. Wiley, 1992.

Cited By

View all
  • (2024)An Empirical Study on Code Coverage of Performance TestingProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661196(48-57)Online publication date: 18-Jun-2024
  • (2024)Evaluating Search-Based Software Microbenchmark PrioritizationIEEE Transactions on Software Engineering10.1109/TSE.2024.3380836(1-16)Online publication date: 2024
  • (2023)A Study of Java Microbenchmark Tail LatenciesCompanion of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578245.3584690(77-81)Online publication date: 15-Apr-2023
  • Show More Cited By

Index Terms

  1. Rigorous benchmarking in reasonable time

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      ISMM '13: Proceedings of the 2013 international symposium on memory management
      June 2013
      140 pages
      ISBN:9781450321006
      DOI:10.1145/2491894
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 20 June 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. benchmarking methodology
      2. dacapo
      3. spec cpu
      4. statistical methods

      Qualifiers

      • Research-article

      Conference

      ISMM '13
      Sponsor:
      ISMM '13: International Symposium on Memory Management
      June 20 - 21, 2013
      Washington, Seattle, USA

      Acceptance Rates

      ISMM '13 Paper Acceptance Rate 11 of 22 submissions, 50%;
      Overall Acceptance Rate 72 of 156 submissions, 46%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)60
      • Downloads (Last 6 weeks)8
      Reflects downloads up to 17 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)An Empirical Study on Code Coverage of Performance TestingProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661196(48-57)Online publication date: 18-Jun-2024
      • (2024)Evaluating Search-Based Software Microbenchmark PrioritizationIEEE Transactions on Software Engineering10.1109/TSE.2024.3380836(1-16)Online publication date: 2024
      • (2023)A Study of Java Microbenchmark Tail LatenciesCompanion of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578245.3584690(77-81)Online publication date: 15-Apr-2023
      • (2023)Faster or Slower? Performance Mystery of Python Idioms Unveiled with Empirical Evidence2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00130(1495-1507)Online publication date: May-2023
      • (2023)Towards effective assessment of steady state performance in Java software: are we there yet?Empirical Software Engineering10.1007/s10664-022-10247-x28:1Online publication date: 1-Jan-2023
      • (2021)How Software Refactoring Impacts Execution TimeACM Transactions on Software Engineering and Methodology10.1145/348513631:2(1-23)Online publication date: 24-Dec-2021
      • (2021)Would you like a quick peek? providing logging support to monitor data processing in big data applicationsProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468613(516-526)Online publication date: 20-Aug-2021
      • (2021)KernelFaRerACM Transactions on Architecture and Code Optimization10.1145/345901018:3(1-22)Online publication date: 28-Jun-2021
      • (2021)What's Wrong with My Benchmark Results? Studying Bad Practices in JMH BenchmarksIEEE Transactions on Software Engineering10.1109/TSE.2019.292534547:7(1452-1467)Online publication date: 1-Jul-2021
      • (2021)SPARC: Statistical Performance Analysis With Relevance ConclusionsIEEE Open Journal of the Computer Society10.1109/OJCS.2021.30606582(117-129)Online publication date: 2021
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media