ABSTRACT
Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to either always pass or always fail for the same code under test. Unfortunately, in practice, some tests often called flaky tests—have non-deterministic outcomes. Such tests undermine the regression testing as they make it difficult to rely on test results. We present the first extensive study of flaky tests. We study in detail a total of 201 commits that likely fix flaky tests in 51 open-source projects. We classify the most common root causes of flaky tests, identify approaches that could manifest flaky behavior, and describe common strategies that developers use to fix flaky tests. We believe that our insights and implications can help guide future research on the important topic of (avoiding) flaky tests.
- API design wiki - OrderOfElements. http://wiki.apidesign.org/wiki/OrderOfElements.Google Scholar
- Android FlakyTest annotation. http://goo.gl/e8PILv.Google Scholar
- Apache Software Foundation SVN Repository. http://svn.apache.org/repos/asf/.Google Scholar
- Apache Software Foundation. HBASE-2684. https://issues.apache.org/jira/browse/HBASE-2684.Google Scholar
- A. Bachmann, C. Bird, F. Rahman, P. T. Devanbu, and A. Bernstein. The missing links: bugs and bug-fix commits. In FSE, 2010. Google ScholarDigital Library
- E. T. Barr, T. Vo, V. Le, and Z. Su. Automatic detection of floating-point exceptions. In POPL, 2013. Google ScholarDigital Library
- J. Bell and G. Kaiser. Unit test virtualization with VMVM. In ICSE, 2014. Google ScholarDigital Library
- N. Bettenburg, W. Shang, W. M. Ibrahim, B. Adams, Y. Zou, and A. E. Hassan. An empirical study on inconsistent changes to code clones at the release level. SCP, 2012. Google ScholarDigital Library
- B. Daniel, V. Jagannath, D. Dig, and D. Marinov. ReAssert: Suggesting repairs for broken unit tests. In ASE, 2009. Google ScholarDigital Library
- O. Edelstein, E. Farchi, E. Goldin, Y. Nir, G. Ratsaby, and S. Ur. Framework for Testing Multi-Threaded Java Programs. CCPE, 2003.Google ScholarCross Ref
- E. Farchi, Y. Nir, and S. Ur. Concurrent bug patterns and how to test them. In IPDPS, 2003. Google ScholarDigital Library
- Flakiness dashboard HOWTO. http://goo.gl/JRZ1J8.Google Scholar
- M. Fowler. Eradicating non-determinism in tests. http://goo.gl/cDDGmm.Google Scholar
- P. Guo, T. Zimmermann, N. Nagappan, and B. Murphy. Characterizing and predicting which bugs get fixed: an empirical study of Microsoft Windows. In ICSE, 2010. Google ScholarDigital Library
- P. Gupta, M. Ivey, and J. Penix. Testing at the speed and scale of Google, 2011. http://goo.gl/2B5cyl.Google Scholar
- V. Jagannath, M. Gligoric, D. Jin, Q. Luo, G. Rosu, and D. Marinov. Improved multithreaded unit testing. In FSE, 2011. Google ScholarDigital Library
- Jenkins RandomFail annotation. http://goo.gl/tzyC0W.Google Scholar
- G. Jin, L. Song, W. Zhang, S. Lu, and B. Liblit. Automated atomicity-violation fixing. In PLDI, 2011. Google ScholarDigital Library
- JUnit and Java7. http://goo.gl/g4crZL.Google Scholar
- F. Lacoste. Killing the gatekeeper: Introducing a continuous integration system. In Agile, 2009. Google ScholarDigital Library
- T. Lavers and L. Peters. Swing Extreme Testing. 2008.Google Scholar
- Z. Li, L. Tan, X. Wang, S. Lu, Y. Zhou, and C. Zhai. Have things changed now?: An empirical study of bug characteristics in modern open source software. In ASID, 2006. Google ScholarDigital Library
- Y. Lin and D. Dig. CHECK-THEN-ACT misuse of Java concurrent collections. In ICST, 2013. Google ScholarDigital Library
- S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes: A comprehensive study on real world concurrency bug characteristics. In ASPLOS, 2008. Google ScholarDigital Library
- P. Marinescu, P. Hosek, and C. Cadar. Covrig: A framework for the analysis of code, test, and coverage evolution in real software. In ISSTA, 2014. Google ScholarDigital Library
- A. M. Memon and M. B. Cohen. Automated testing of GUI applications: models, tools, and controlling flakiness. In ICSE, 2013. Google ScholarDigital Library
- J. Micco. Continuous integration at Google scale, 2013. http://goo.gl/0qnzGj.Google Scholar
- B. P. Miller, L. Fredriksen, and B. So. An empirical study of the reliability of Unix utilities. CACM, 1990. Google ScholarDigital Library
- K. Mu¸slu, B. Soran, and J. Wuttke. Finding bugs by isolating unit tests. ESEC/FSE, 2011.Google Scholar
- E. R. Murphy-Hill, T. Zimmermann, C. Bird, and N. Nagappan. The design of bug fixes. In ICSE, 2013. Google ScholarDigital Library
- Spring Repeat Annotation. http://goo.gl/vnfU3Y.Google Scholar
- P. Sudarshan. No more flaky tests on the Go team. http://goo.gl/BiWaE1.Google Scholar
- 6 tips for writing robust, maintainable unit tests. http://blog.melski.net/tag/unit-tests.Google Scholar
- TotT: Avoiding flakey tests. http://goo.gl/vHE47r.Google Scholar
- R. Tzoref, S. Ur, and E. Yom-Tov. Instrumenting where it hurts: An automatic concurrent debugging technique. In ISSTA, 2007. Google ScholarDigital Library
- G. Yang, S. Khurshid, and M. Kim. Specification-based test repair using a lightweight formal method. In FM, 2012.Google ScholarCross Ref
- S. Zhang, D. Jalali, J. Wuttke, K. Muslu, M. Ernst, and D. Notkin. Empirically revisiting the test independence assumption. In ISSTA, 2014. Introduction Methodology Causes of Flakiness Categories of Flakiness Root Causes Async Wait Concurrency Test Order Dependency Other Root Causes Flaky Test Introduction Manifestation Platform (In)dependency Flakiness Manifestation Strategies Async Wait Concurrency Test Order Dependency Fixing Strategies Common Fixes and Effectiveness Async Wait Concurrency Test Order Dependency Others Changes to the Code under Test Threats to Validity Related Work Conclusions Acknowledgments References Google ScholarDigital Library
Index Terms
- An empirical analysis of flaky tests
Recommendations
Mitigating the effects of flaky tests on mutation testing
ISSTA 2019: Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and AnalysisMutation testing is widely used in research as a metric for evaluating the quality of test suites. Mutation testing runs the test suite on generated mutants (variants of the code under test), where a test suite kills a mutant if any of the tests fail ...
Do Automatic Test Generation Tools Generate Flaky Tests?
ICSE '24: Proceedings of the IEEE/ACM 46th International Conference on Software EngineeringNon-deterministic test behavior, or flakiness, is common and dreaded among developers. Researchers have studied the issue and proposed approaches to mitigate it. However, the vast majority of previous work has only considered developer-written tests. The ...
A Survey of Flaky Tests
Tests that fail inconsistently, without changes to the code under test, are described as flaky. Flaky tests do not give a clear indication of the presence of software bugs and thus limit the reliability of the test suites that contain them. A recent ...
Comments