skip to main content
10.1145/3319008.3319021acmotherconferencesArticle/Chapter ViewAbstractPublication PageseaseConference Proceedingsconference-collections
research-article

Is the Stack Distance Between Test Case and Method Correlated With Test Effectiveness?

Published:15 April 2019Publication History

ABSTRACT

Mutation testing is a means to assess the effectiveness of a test suite and its outcome is considered more meaningful than code coverage metrics. However, despite several optimizations, mutation testing requires a significant computational effort and has not been widely adopted in industry. Therefore, we study in this paper whether test effectiveness can be approximated using a more light-weight approach. We hypothesize that a test case is more likely to detect faults in methods that are close to the test case on the call stack than in methods that the test case accesses indirectly through many other methods. Based on this hypothesis, we propose the minimal stack distance between test case and method as a new test measure, which expresses how close any test case comes to a given method, and study its correlation with test effectiveness. We conducted an empirical study with 21 open-source projects, which comprise in total 1.8 million LOC, and show that a correlation exists between stack distance and test effectiveness. The correlation reaches a strength up to 0.58. We further show that a classifier using the minimal stack distance along with additional easily computable measures can predict the mutation testing result of a method with 92.9% precision and 93.4% recall. Hence, such a classifier can be taken into consideration as a light-weight alternative to mutation testing or as a preceding, less costly step to that.

References

  1. Allen Troy Acree Jr. 1980. On Mutation. Technical Report. Georgia Institute of Tech.Google ScholarGoogle Scholar
  2. Iftekhar Ahmed, Rahul Gopinath, Caius Brindescu, Alex Groce, and Carlos Jensen. 2016. Can Testedness Be Effectively Measured?. In Proc. 24th International Symposium on Foundations of Software Engineering (FSE'16). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Vard Antinyan, Jesper Derehag, Anna Sandberg, and Miroslaw Staron. 2018. Mythical Unit Test Coverage. IEEE Software 35, 3 (2018).Google ScholarGoogle ScholarCross RefCross Ref
  4. Author 1. 2018. Pitest: pull request for computing a full mutation matrix. (2018). https://github.com/hcoles/pitest/pull/511.Google ScholarGoogle Scholar
  5. Paul Barford and Mark Crovella. 1998. Generating Representative Web Workloads for Network and Server Performance Evaluation. In ACM SIGMETRICS Performance Evaluation Review, Vol. 26. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bartosz Bogacki and Bartosz Walter. 2006. Evaluation of Test Code Quality with Aspect-Oriented Mutations. In Proc. 6th International Conference on Extreme Programming and Agile Processes in Software Engineering (XP'06). Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Calin Caşcaval and David A Padua. 2003. Estimating Cache Misses and Locality Using Stack Distances. In Proc. 17th International Conference on Supercomputing (ICS'03). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research (JAIR) 16 (2002). Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. John Joseph Chilenski and Steven P Miller. 1994. Applicability of Modified Condition/Decision Coverage to Software Testing. Software Engineering Journal 9, 5 (1994).Google ScholarGoogle ScholarCross RefCross Ref
  10. Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. PIT: A Practical Mutation Testing Tool For Java. In Proc. 25th International Symposium on Software Testing and Analysis (ISSTA'16). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Richard A DeMillo, Richard J Lipton, and Frederick G Sayward. 1978. Hints on Test Data Selection: Help for the Practicing Programmer. Computer 11,4 (1978). Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Vladimir N Fleyshgakker and Stewart N Weiss. 1994. Efficient Mutation Analysis: A New Approach. In Proc. 3rd International Symposium on Software Testing and Analysis (ISSTA'94). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Rahul Gopinath, Amin Alipour, Iftekhar Ahmed, Carlos Jensen, and Alex Groce. 2015. How Hard Does Mutation Analysis Have to Be, Anyway?. In Proc. 26th International Symposium on Software Reliability Engineering (ISSRE'15). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Rahul Gopinath, Mohammad Amin Alipour, Iftekhar Ahmed, Carlos Jensen, and Alex Groce. 2016. On the Limits of Mutation Reduction Strategies. In Proc. 38th International Conference on Software Engineering (ICSE'16). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Bernhard JM Grün, David Schuler, and Andreas Zeller. 2009. The Impact of Equivalent Mutants. In Proc. International Conference on Software Testing, Verification and Validation Workshops (ICSTW'09). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lars Heinemann, Benjamin Hummel, and Daniela Steidl. 2014. Teamscale: Software quality control in real-time. In Companion Proc. 36th International Conference on Software Engineering (ICSE'14 Companion). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hadi Hemmati. 2015. How Effective Are Code Coverage Criteria?. In Proc. 15th International Conference on Software Quality, Reliability and Security (QRS'15). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. William E. Howden. 1982. Weak Mutation Testing and Completeness of Test Sets. IEEE Transactions on Software Engineering (TSE) 4 (1982). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. JC Huang. 1975. An Approach to Program Testing. ACM Computing Surveys (CSUR) 7, 3 (1975). Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Laura Inozemtseva and Reid Holmes. 2014. Coverage Is Not Strongly Correlated With Test Suite Effectiveness. In Proc. 36th International Conference on Software Engineering (ICSE'14). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Goran Petrović Marko Ivanković, Bob Kurtz, Paul Ammann, and René Just. 2018. An Industrial Application of Mutation Testing: Lessons, Challenges, and Research Directions. In Proc. 13th International Workshop on Mutation Analysis (MUTATION'18).Google ScholarGoogle Scholar
  22. Kevin Jalbert and Jeremy S Bradbury. 2012. Predicting mutation score using source code and test suite metrics. In Proc. 1st International Workshop on Realizing AI Synergies in Software Engineering (RAISE'12). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Changbin Ji, Zhenyu Chen, Baowen Xu, and Zhihong Zhao. 2009. A Novel Method of Mutation Clustering Based on Domain Analysis.. In Proc. 21st International Conference on Software Engineering and Knowledge Engineering (SEKE'09), Vol. 9.Google ScholarGoogle Scholar
  24. Yue Jia and Mark Harman. 2008. Constructing Subtle Faults Using Higher Order Mutation Testing. In Proc. 8th International Working Conference on Source Code Analysis and Manipulation (SCAM'08). IEEE.Google ScholarGoogle ScholarCross RefCross Ref
  25. Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. Transactions on Software Engineering(TSE) 37, 5 (2011). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proc. 23rd International Symposium on Software Testing and Analysis (ISSTA'14). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ron Kohavi and others. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Max Kuhn, the R Core Team, and further contributors. 2017. caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret R package version 6.0--76.Google ScholarGoogle Scholar
  29. Richard J Lipton. 1971. Fault Diagnosis of Computer Programs. Technical Report.Google ScholarGoogle Scholar
  30. Yu-Seung Ma, Jeff Offutt, and Yong Rae Kwon. 2005. MuJava: An Automated Class Mutation System. Software Testing, Verification and Reliability (STVR) 15, 2 (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Richard L. Mattson, Jan Gecsei, Donald R. Slutz, and Irving L. Traiger. 1970. Evaluation Techniques for Storage Hierarchies. IBM Systems Journal 9, 2 (1970). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Rainer Niedermayr. 2018. TestAnalyzer. (2018). https://github.com/cqse/test-analyzer/ Computation of the Minimal Stack Distance (V3).Google ScholarGoogle Scholar
  33. Rainer Niedermayr, Elmar Juergens, and Stefan Wagner. 2016. Will My Tests Tell Me If I Break This Code?. In Proc. 1st International Workshop on Continuous Software Evolution and Delivery (CSED'16). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Rainer Niedermayr and Stefan Wagner. 2019. Dataset: Is the Stack Distance Between Method and Test Case Correlated With Test Effectiveness? (2019).Google ScholarGoogle Scholar
  35. A Jefferson Offutt, Ammei Lee, Gregg Rothermel, Roland H Untch, and Christian Zapf. 1996. An Experimental Determination of Sufficient Mutant Operators. ACM Transactions on Software Engineering and Methodology (TOSEM) 5, 2 (1996). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A Jefferson Offutt, Gregg Rothermel, and Christian Zapf. 1993. An Experimental Evaluation of Selective Mutation. In Proc. 15th International Conference on Software Engineering (ICSE'93). IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A Jefferson Offutt and Roland H Untch. 2001. Mutation 2000: Uniting the Orthogonal. In Mutation Testing for the New Century. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2017. Mutation Testing Advances: An Analysis and Survey. Advances in Computers (2017).Google ScholarGoogle Scholar
  39. Sandra Rapps and Elaine J Weyuker. 1982. Data Flow Analysis Techniques for Test Data Selection. In Proc. 6th International Conference on Software Engineering (ICSE'82). IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. David Schuler, Valentin Dallmeier, and Andreas Zeller. 2009. Efficient Mutation Testing by Checking Invariant Violations. In Proc. 18th International Symposium on Software Testing and Analysis (ISSTA'09). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. David Schuler and Andreas Zeller. 2013. Checked Coverage: An Indicator for Oracle Quality. Software Testing, Verification and Reliability (STVR) 23, 7 (2013).Google ScholarGoogle Scholar
  42. Akbar Siami Namin, James H Andrews, and Duncan J Murdoch. 2008. Sufficient Mutation Operators for Measuring Test Effectiveness. In Proc. 30th International Conference on Software Engineering (ICSE'08). ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Joanna Strug and Barbara Strug. 2012. Machine learning approach in mutation testing. In Proc. 24th International Conference on Testing Software and Systems (ICTSS'12). Springer.Google ScholarGoogle ScholarCross RefCross Ref
  44. Joanna Strug and Barbara Strug. 2018. Evaluation of the prediction-based approach to cost reduction in mutation testing. In Proc. 39th International Conference on Information Systems Architecture and Technology (ISAT'18). Springer.Google ScholarGoogle Scholar
  45. Macario Polo Usaola and Pedro Reales Mateo. 2010. Mutation Testing Cost Reduction Techniques: A Survey. IEEE Software 27, 3 (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Oscar Luis Vera-Pérez, Martin Monperrus, and Benoit Baudry. 2018. Descartes: a PITest engine to detect pseudo-tested methods-tool demonstration. In Proc. 33rd International Conference on Automated Software Engineering (ASE'18). ACM Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Jie Zhang, Lingming Zhang, Mark Harman, Dan Hao, Yue Jia, and Lu Zhang. 2018. Predictive Mutation Testing. Transactions on Software Engineering (TSE) (2018).Google ScholarGoogle Scholar
  48. Hong Zhu, Patrick AV Hall, and John HR May. 1997. Software Unit Test Coverage and Adequacy. ACM Computing Surveys (CSUR) 29, 4 (1997). Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Is the Stack Distance Between Test Case and Method Correlated With Test Effectiveness?

    Recommendations

    Reviews

    Vladik Kreinovich

    In general, it is not algorithmically possible to always prove a program's correctness or incorrectness. So, in practice, a program's correctness is usually judged by testing the program on test cases from some test suite. How can we gauge the effectiveness of a test suite Practitioners use metrics like code coverage, that is, the proportion of the code that is executed in at least some of the test cases. However, this metric is not well correlated with test suite effectiveness: some test suites have high code coverage, but fail to detect faults. A more effective technique is mutation testing, where we add faults into different parts of the code and see if the tests detect these faults. However, this technique is very time-consuming and thus rarely used in practice. It is therefore desirable to come up with a measure of test suite effectiveness that is more accurate than code coverage and more practical than mutation testing. The authors take into account that, for each test, some methods are invoked directly while other methods are invoked indirectly via a chain of method calls. They show, crudely speaking, that the length of this chain-which they call stack distance-is highly correlated with test suite effectiveness in detecting the method's faults. To be more precise: for this correlation to appear, we need to slightly modify the definition of the length, for example, we should not count calls to external libraries (since these libraries are supposed to be already tested), we should not count calls to constructor methods (such methods rarely contain faults), and so on. An even better prediction of possibly faulty methods comes if we also take into account different versions of code coverage and other easy-to-compute characteristics, and train a neural network to predict the method's faultiness based on all of this information. This leads to a practically useful alternative to mutation testing. The paper is intended for specialists and practitioners in software engineering who are familiar with the basics of statistics. I recommend it to both theoreticians and practitioners.

    Access critical reviews of Computing literature here

    Become a reviewer for Computing Reviews.

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      EASE '19: Proceedings of the 23rd International Conference on Evaluation and Assessment in Software Engineering
      April 2019
      345 pages
      ISBN:9781450371452
      DOI:10.1145/3319008

      Copyright © 2019 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 April 2019

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed limited

      Acceptance Rates

      EASE '19 Paper Acceptance Rate20of73submissions,27%Overall Acceptance Rate71of232submissions,31%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader