research-article

An empirical analysis of flaky tests

Authors:
Qingzhou Luo

University of Illinois at Urbana-Champaign, USA

University of Illinois at Urbana-Champaign, USA
View Profile

,
Farah Hariri

University of Illinois at Urbana-Champaign, USA

University of Illinois at Urbana-Champaign, USA
View Profile

,
Lamyaa Eloussi

University of Illinois at Urbana-Champaign, USA

University of Illinois at Urbana-Champaign, USA
View Profile

,
Darko Marinov

University of Illinois at Urbana-Champaign, USA

University of Illinois at Urbana-Champaign, USA
View Profile

FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software EngineeringNovember 2014Pages 643–653https://doi.org/10.1145/2635868.2635920

Published:11 November 2014Publication History

FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

Pages 643–653

ABSTRACT

Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to either always pass or always fail for the same code under test. Unfortunately, in practice, some tests often called flaky tests—have non-deterministic outcomes. Such tests undermine the regression testing as they make it difficult to rely on test results. We present the first extensive study of flaky tests. We study in detail a total of 201 commits that likely fix flaky tests in 51 open-source projects. We classify the most common root causes of flaky tests, identify approaches that could manifest flaky behavior, and describe common strategies that developers use to fix flaky tests. We believe that our insights and implications can help guide future research on the important topic of (avoiding) flaky tests.

References

API design wiki - OrderOfElements. http://wiki.apidesign.org/wiki/OrderOfElements.Google Scholar
Android FlakyTest annotation. http://goo.gl/e8PILv.Google Scholar
Apache Software Foundation SVN Repository. http://svn.apache.org/repos/asf/.Google Scholar
Apache Software Foundation. HBASE-2684. https://issues.apache.org/jira/browse/HBASE-2684.Google Scholar
A. Bachmann, C. Bird, F. Rahman, P. T. Devanbu, and A. Bernstein. The missing links: bugs and bug-fix commits. In FSE, 2010. Google ScholarDigital Library
E. T. Barr, T. Vo, V. Le, and Z. Su. Automatic detection of floating-point exceptions. In POPL, 2013. Google ScholarDigital Library
J. Bell and G. Kaiser. Unit test virtualization with VMVM. In ICSE, 2014. Google ScholarDigital Library
N. Bettenburg, W. Shang, W. M. Ibrahim, B. Adams, Y. Zou, and A. E. Hassan. An empirical study on inconsistent changes to code clones at the release level. SCP, 2012. Google ScholarDigital Library
B. Daniel, V. Jagannath, D. Dig, and D. Marinov. ReAssert: Suggesting repairs for broken unit tests. In ASE, 2009. Google ScholarDigital Library
O. Edelstein, E. Farchi, E. Goldin, Y. Nir, G. Ratsaby, and S. Ur. Framework for Testing Multi-Threaded Java Programs. CCPE, 2003.Google ScholarCross Ref
E. Farchi, Y. Nir, and S. Ur. Concurrent bug patterns and how to test them. In IPDPS, 2003. Google ScholarDigital Library
Flakiness dashboard HOWTO. http://goo.gl/JRZ1J8.Google Scholar
M. Fowler. Eradicating non-determinism in tests. http://goo.gl/cDDGmm.Google Scholar
P. Guo, T. Zimmermann, N. Nagappan, and B. Murphy. Characterizing and predicting which bugs get fixed: an empirical study of Microsoft Windows. In ICSE, 2010. Google ScholarDigital Library
P. Gupta, M. Ivey, and J. Penix. Testing at the speed and scale of Google, 2011. http://goo.gl/2B5cyl.Google Scholar
V. Jagannath, M. Gligoric, D. Jin, Q. Luo, G. Rosu, and D. Marinov. Improved multithreaded unit testing. In FSE, 2011. Google ScholarDigital Library
Jenkins RandomFail annotation. http://goo.gl/tzyC0W.Google Scholar
G. Jin, L. Song, W. Zhang, S. Lu, and B. Liblit. Automated atomicity-violation fixing. In PLDI, 2011. Google ScholarDigital Library
JUnit and Java7. http://goo.gl/g4crZL.Google Scholar
F. Lacoste. Killing the gatekeeper: Introducing a continuous integration system. In Agile, 2009. Google ScholarDigital Library
T. Lavers and L. Peters. Swing Extreme Testing. 2008.Google Scholar
Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai. Have things changed now?: An empirical study of bug characteristics in modern open source software. In ASID, 2006. Google ScholarDigital Library
Y. Lin and D. Dig. CHECK-THEN-ACT misuse of Java concurrent collections. In ICST, 2013. Google ScholarDigital Library
S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: A comprehensive study on real world concurrency bug characteristics. In ASPLOS, 2008. Google ScholarDigital Library
P. Marinescu, P. Hosek, and C. Cadar. Covrig: A framework for the analysis of code, test, and coverage evolution in real software. In ISSTA, 2014. Google ScholarDigital Library
A. M. Memon and M. B. Cohen. Automated testing of GUI applications: models, tools, and controlling flakiness. In ICSE, 2013. Google ScholarDigital Library
J. Micco. Continuous integration at Google scale, 2013. http://goo.gl/0qnzGj.Google Scholar
B. P. Miller, L. Fredriksen, and B. So. An empirical study of the reliability of Unix utilities. CACM, 1990. Google ScholarDigital Library
K. Mu¸slu, B. Soran, and J. Wuttke. Finding bugs by isolating unit tests. ESEC/FSE, 2011.Google Scholar
E. R. Murphy-Hill, T. Zimmermann, C. Bird, and N. Nagappan. The design of bug fixes. In ICSE, 2013. Google ScholarDigital Library
Spring Repeat Annotation. http://goo.gl/vnfU3Y.Google Scholar
P. Sudarshan. No more flaky tests on the Go team. http://goo.gl/BiWaE1.Google Scholar
6 tips for writing robust, maintainable unit tests. http://blog.melski.net/tag/unit-tests.Google Scholar
TotT: Avoiding flakey tests. http://goo.gl/vHE47r.Google Scholar
R. Tzoref, S. Ur, and E. Yom-Tov. Instrumenting where it hurts: An automatic concurrent debugging technique. In ISSTA, 2007. Google ScholarDigital Library
G. Yang, S. Khurshid, and M. Kim. Specification-based test repair using a lightweight formal method. In FM, 2012.Google ScholarCross Ref
S. Zhang, D. Jalali, J. Wuttke, K. Muslu, M. Ernst, and D. Notkin. Empirically revisiting the test independence assumption. In ISSTA, 2014. Introduction Methodology Causes of Flakiness Categories of Flakiness Root Causes Async Wait Concurrency Test Order Dependency Other Root Causes Flaky Test Introduction Manifestation Platform (In)dependency Flakiness Manifestation Strategies Async Wait Concurrency Test Order Dependency Fixing Strategies Common Fixes and Effectiveness Async Wait Concurrency Test Order Dependency Others Changes to the Code under Test Threats to Validity Related Work Conclusions Acknowledgments References Google ScholarDigital Library

Index Terms

An empirical analysis of flaky tests
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Mitigating the effects of flaky tests on mutation testing
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis

Mutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail ...
Read More
Do Automatic Test Generation Tools Generate Flaky Tests?
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering

Non-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and proposed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The ...
Read More
A Survey of Flaky Tests
Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliability of the test suites that contain them. A recent ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering
November 2014
856 pages
ISBN:9781450330565
DOI:10.1145/2635868
General Chair:
Shing-Chi Cheung
Hong Kong University of Science and Technology, China
,
Program Chairs:
Alessandro Orso
Georgia Institute of Technology, USA
,
Margaret-Anne Storey
University of Victoria, Canada
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 November 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Empirical study
flaky tests
non-determinism
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate17of128submissions,13%
Upcoming Conference
FSE '24

Sponsor:

sigsoft

32nd ACM International Conference on the Foundations of Software Engineering

July 15 - 19, 2024

Ipojuca (Pernambuco) , Brazil
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 261
  Total Citations
  View Citations
- 1,771
  Total Downloads
- Downloads (Last 12 months)298
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An empirical analysis of flaky tests

FSE 2014: Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Mitigating the effects of flaky tests on mutation testing

Do Automatic Test Generation Tools Generate Flaky Tests?

A Survey of Flaky Tests