research-article

Rigorous benchmarking in reasonable time

Authors:

Tomas Kalibera,

Richard JonesAuthors Info & Claims

ISMM '13: Proceedings of the 2013 international symposium on memory management

Pages 63 - 74

https://doi.org/10.1145/2491894.2464160

Published: 20 June 2013 Publication History

Abstract

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many iterations of benchmarks, invoke virtual machines (VMs) several times, or even rebuild VM or benchmark binaries more than once. All this repetition costs time to complete experiments. Currently, many evaluations give up on sufficient repetition or rigorous statistical methods, or even run benchmarks only in training sizes. The results reported often lack proper variation estimates and, when a small difference between two systems is reported, some are simply unreliable.

In contrast, we provide a statistically rigorous methodology for repetition and summarising results that makes efficient use of experimentation time. Time efficiency comes from two key observations. First, a given benchmark on a given platform is typically prone to much less non-determinism than the common worst-case of published corner-case studies. Second, repetition is most needed where most uncertainty arises (whether between builds, between executions or between iterations). We capture experimentation cost with a novel mathematical model, which we use to identify the number of repetitions at each level of an experiment necessary and sufficient to obtain a given level of precision.

We present our methodology as a cookbook that guides researchers on the number of repetitions they should run to obtain reliable results. We also show how to present results with an effect size confidence interval. As an example, we show how to use our methodology to conduct throughput experiments with the DaCapo and SPEC CPU benchmarks on three recent platforms.

Supplementary Material

p63-kalibera original submission (p63-kalibera.pdf)

Originally published version of "Rigorous benchmarking in reasonable time"

Download
454.53 KB

References

[1]

M. Arnold, M. Hind, and B. G. Ryder. Online feedback-directed optimization of Java. In Proceedings of the 17th annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2002.

Digital Library

[2]

S. Basu and A. DasGupta. Robustness of standard confidence intervals for location parameters under departure from normality. Annals of Statistics, 23(4):1433--1442, 1995.

[3]

S. M. Blackburn, R. Garner, C. Hoffman, A. M. Khan, K. S. McKinley, R. Bentzur, A. Diwan, D. Feinberg, D. Frampton, S. Z. Guyer, M. Hirzel, A. Hosking, M. Jump, H. Lee, J. E. B. Moss, A. Phansalkar, D. Stefanović, T. VanDrunen, D. von Dincklage, and B. Wiedermann. The DaCapo benchmarks: Java benchmarking development and analysis. In Proceedings of the 21nd annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA), pages 169--190. ACM, 2006.

Digital Library

[4]

W. G. Cochran. Sampling Techniques: Third Edition. Wiley, 1977.

[5]

R. Coe. It's the effect size, stupid: What effect size is and why it is important. In Annual Conference of the British Educational Research Association (BERA), 2002.

[6]

J. Cohen. The Earth is round (pverb <.05). American Psychologist, 49(12):997--1003, 1994.

[7]

C. Curtsinger and E. D. Berger. Stabilizer: Statistically sound performance evaluation. In Proceedings of the eighteenth international conference on Architectural support for programming languages and operating systems (ASPLOS). ACM, 2013.

Digital Library

[8]

E. C. Fieller. Some problems in interval estimation. Journal of the Royal Statistical Society, 16(2):175--185, 1954.

[9]

A. Georges, D. Buytaert, and L. Eeckhout. Statistically rigorous Java performance evaluation. In Proceedings of the 22nd annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2007.

Digital Library

[10]

A. Georges, L. Eeckhout, and D. Buytaert. Java performance evaluation through rigorous replay compilation. In Proceedings of the 23rd ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2008.

Digital Library

[11]

D. Gu, C. Verbrugge, and E. Gagnon. Code layout as a source of noise in JVM performance. In Component And Middleware Performance Workshop, OOPSLA, 2004.

[12]

C. Hill and B. Thompson. Computing and interpreting effect sizes. In Higher Education: Handbook of Theory and Research, volume 19, pages 175--196. Springer, 2005.

[13]

R. Jain. The Art of Computer Systems Performance Analysis. Wiley, 1991.

[14]

T. Kalibera and R. E. Jones. Quantifying performance changes with effect size confidence intervals. Technical Report 4--12, University of Kent, 2012.

[15]

T. Kalibera and P. Tuma. Precise regression benchmarking with random effects: Improving Mono benchmark results. In Proceedings of Third European Performance Engineering Workshop (EPEW), volume 4054 of LNCS. Springer, 2006.

Digital Library

[16]

D. J. Lilja. Measuring Computer Performance: A Practitioner's Guide. Cambridge University Press, 2000.

Digital Library

[17]

Y. Luo and L. K. John. Efficiently evaluating speedup using sampled processor simulation. IEEE Computer Architecture Letters, 3(1):6--6, 2004.

Digital Library

[18]

S. E. Maxwell and H. D. Delaney. Designing Experiments and Analyzing Data: a Model Comparison Perspective. Routledge, 2004.

[19]

C. E. McCulloch, S. R. Searle, and J. M. Neuhaus. Generalized, Linear, and Mixed Models. Wiley, 2008.

[20]

T. Mytkowicz, A. Diwan, M. Hauswirth, and P. F. Sweeney. Producing wrong data without doing anything obviously wrong! In Proceeding of the 14th international conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2009.

Digital Library

[21]

S. Nakagawa and I. C. Cuthill. Effect size, confidence interval and statistical significance: a practical guide for biologists. Biological Reviews, 82(4):591--605, 2007.

[22]

K. Ogata, T. Onodera, K. Kawachiya, H. Komatsu, and T. Nakatani. Replay compilation: Improving debuggability of a just-in-time compiler. In Proceedings of the 21st annual ACM SIGPLAN conference on Object-Oriented Programming Systems, Languages and Applications (OOPSLA). ACM, 2006.

Digital Library

[23]

M. E. Payton, M. H. Greenstone, and N. Schenker. Overlapping confidence intervals or standard error intervals: What do they mean in terms of statistical significance? Journal of Insect Science, 3(1996), 2003.

[24]

D. Rasch and V. Guiard. The robustness of parametric statistical methods. Psychology Science, 46(2):175--208, 2004.

[25]

R. M. Royall. The effect of sample size on the meaning of significance tests. American Statistician, 40(4):313--315, 1986.

[26]

S. R. Searle, G. Casella, and C. E. McCulloch. Variance Components. Wiley, 1992.

Cited By

Imran MCortellessa VDi Ruscio DRubei RTraini L(2024)An Empirical Study on Code Coverage of Performance TestingProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661196(48-57)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1145/3661167.3661196
Laaber CYue TAli S(2024)Evaluating Search-Based Software Microbenchmark PrioritizationIEEE Transactions on Software Engineering10.1109/TSE.2024.3380836(1-16)Online publication date: 2024
https://doi.org/10.1109/TSE.2024.3380836
He SKim IWang WVieira MCardellini VDi Marco ATuma P(2023)A Study of Java Microbenchmark Tail LatenciesCompanion of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578245.3584690(77-81)Online publication date: 15-Apr-2023
https://dl.acm.org/doi/10.1145/3578245.3584690
Show More Cited By

Index Terms

Rigorous benchmarking in reasonable time
1. General and reference
  1. Cross-computing tools and techniques
    1. Measurement
    2. Metrics

Recommendations

Rigorous benchmarking in reasonable time
ISMM '13

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many ...
The DaCapo benchmarks: java benchmarking development and analysis
OOPSLA '06: Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications

Since benchmarks drive computer science research and industry product development, which ones we use and how we evaluate them are key questions for the community. Despite complex runtime tradeoffs due to dynamic compilation and garbage collection ...
Rigorous benchmarking in reasonable time
ISMM '13: Proceedings of the 2013 international symposium on memory management

Experimental evaluation is key to systems research. Because modern systems are complex and non-deterministic, good experimental methodology demands that researchers account for uncertainty. To obtain valid results, they are expected to run many ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ISMM '13: Proceedings of the 2013 international symposium on memory management

June 2013

140 pages

ISBN:9781450321006

DOI:10.1145/2491894

General Chair:
Perry S. Cheng
IBM Research, USA
,
Program Chair:
Erez Petrank
Technion, Israel

ACM SIGPLAN Notices Volume 48, Issue 11
ISMM '13
November 2013
128 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2555670
Editor:
Mark W. Bailey
Hamilton College, Clinton, NY
Issue’s Table of Contents

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGPLAN: ACM Special Interest Group on Programming Languages

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 June 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

ISMM '13

Sponsor:

SIGPLAN

ISMM '13: International Symposium on Memory Management

June 20 - 21, 2013

Washington, Seattle, USA

Acceptance Rates

ISMM '13 Paper Acceptance Rate 11 of 22 submissions, 50%;

Overall Acceptance Rate 72 of 156 submissions, 46%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

60
Total Citations
View Citations
773
Total Downloads

Downloads (Last 12 months)60
Downloads (Last 6 weeks)8

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Imran MCortellessa VDi Ruscio DRubei RTraini L(2024)An Empirical Study on Code Coverage of Performance TestingProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering10.1145/3661167.3661196(48-57)Online publication date: 18-Jun-2024
https://dl.acm.org/doi/10.1145/3661167.3661196
Laaber CYue TAli S(2024)Evaluating Search-Based Software Microbenchmark PrioritizationIEEE Transactions on Software Engineering10.1109/TSE.2024.3380836(1-16)Online publication date: 2024
https://doi.org/10.1109/TSE.2024.3380836
He SKim IWang WVieira MCardellini VDi Marco ATuma P(2023)A Study of Java Microbenchmark Tail LatenciesCompanion of the 2023 ACM/SPEC International Conference on Performance Engineering10.1145/3578245.3584690(77-81)Online publication date: 15-Apr-2023
https://dl.acm.org/doi/10.1145/3578245.3584690
Zhang ZXing ZXia XXu XZhu LLu Q(2023)Faster or Slower? Performance Mystery of Python Idioms Unveiled with Empirical Evidence2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE)10.1109/ICSE48619.2023.00130(1495-1507)Online publication date: May-2023
https://doi.org/10.1109/ICSE48619.2023.00130
Traini LCortellessa VDi Pompeo DTucci M(2023)Towards effective assessment of steady state performance in Java software: are we there yet?Empirical Software Engineering10.1007/s10664-022-10247-x28:1Online publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1007/s10664-022-10247-x
Traini LDi Pompeo DTucci MLin BScalabrino SBavota GLanza MOliveto RCortellessa V(2021)How Software Refactoring Impacts Execution TimeACM Transactions on Software Engineering and Methodology10.1145/348513631:2(1-23)Online publication date: 24-Dec-2021
https://dl.acm.org/doi/10.1145/3485136
Wang ZZhang HChen TWang SSpinellis DGousios GChechik MDi Penta M(2021)Would you like a quick peek? providing logging support to monitor data processing in big data applicationsProceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3468264.3468613(516-526)Online publication date: 20-Aug-2021
https://dl.acm.org/doi/10.1145/3468264.3468613
De Carvalho JKuzma BKorostelev IAmaral JBarton CMoreira JAraujo G(2021)KernelFaRerACM Transactions on Architecture and Code Optimization10.1145/345901018:3(1-22)Online publication date: 28-Jun-2021
https://dl.acm.org/doi/10.1145/3459010
Costa DBezemer CLeitner PAndrzejak A(2021)What's Wrong with My Benchmark Results? Studying Bad Practices in JMH BenchmarksIEEE Transactions on Software Engineering10.1109/TSE.2019.292534547:7(1452-1467)Online publication date: 1-Jul-2021
https://doi.org/10.1109/TSE.2019.2925345
Tullos JGraham SJordan JPatel P(2021)SPARC: Statistical Performance Analysis With Relevance ConclusionsIEEE Open Journal of the Computer Society10.1109/OJCS.2021.30606582(117-129)Online publication date: 2021
https://doi.org/10.1109/OJCS.2021.3060658
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten