research-article

Is the Stack Distance Between Test Case and Method Correlated With Test Effectiveness?

Authors:
Rainer Niedermayr

University of Stuttgart, CQSE GmbH, Garching b. München, Germany

University of Stuttgart, CQSE GmbH, Garching b. München, Germany
View Profile

,
Stefan Wagner

University of Stuttgart, Stuttgart, Germany

University of Stuttgart, Stuttgart, Germany
View Profile

EASE '19: Proceedings of the 23rd International Conference on Evaluation and Assessment in Software EngineeringApril 2019Pages 189–198https://doi.org/10.1145/3319008.3319021

Published:15 April 2019Publication History

EASE '19: Proceedings of the 23rd International Conference on Evaluation and Assessment in Software Engineering

Pages 189–198

ABSTRACT

Mutation testing is a means to assess the effectiveness of a test suite and its outcome is considered more meaningful than code coverage metrics. However, despite several optimizations, mutation testing requires a significant computational effort and has not been widely adopted in industry. Therefore, we study in this paper whether test effectiveness can be approximated using a more light-weight approach. We hypothesize that a test case is more likely to detect faults in methods that are close to the test case on the call stack than in methods that the test case accesses indirectly through many other methods. Based on this hypothesis, we propose the minimal stack distance between test case and method as a new test measure, which expresses how close any test case comes to a given method, and study its correlation with test effectiveness. We conducted an empirical study with 21 open-source projects, which comprise in total 1.8 million LOC, and show that a correlation exists between stack distance and test effectiveness. The correlation reaches a strength up to 0.58. We further show that a classifier using the minimal stack distance along with additional easily computable measures can predict the mutation testing result of a method with 92.9% precision and 93.4% recall. Hence, such a classifier can be taken into consideration as a light-weight alternative to mutation testing or as a preceding, less costly step to that.

References

Allen Troy Acree Jr. 1980. On Mutation. Technical Report. Georgia Institute of Tech.Google Scholar
Iftekhar Ahmed, Rahul Gopinath, Caius Brindescu, Alex Groce, and Carlos Jensen. 2016. Can Testedness Be Effectively Measured?. In Proc. 24th International Symposium on Foundations of Software Engineering (FSE'16). ACM. Google ScholarDigital Library
Vard Antinyan, Jesper Derehag, Anna Sandberg, and Miroslaw Staron. 2018. Mythical Unit Test Coverage. IEEE Software 35, 3 (2018).Google ScholarCross Ref
Author 1. 2018. Pitest: pull request for computing a full mutation matrix. (2018). https://github.com/hcoles/pitest/pull/511.Google Scholar
Paul Barford and Mark Crovella. 1998. Generating Representative Web Workloads for Network and Server Performance Evaluation. In ACM SIGMETRICS Performance Evaluation Review, Vol. 26. ACM. Google ScholarDigital Library
Bartosz Bogacki and Bartosz Walter. 2006. Evaluation of Test Code Quality with Aspect-Oriented Mutations. In Proc. 6th International Conference on Extreme Programming and Agile Processes in Software Engineering (XP'06). Springer. Google ScholarDigital Library
Calin Caşcaval and David A Padua. 2003. Estimating Cache Misses and Locality Using Stack Distances. In Proc. 17th International Conference on Supercomputing (ICS'03). ACM. Google ScholarDigital Library
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of Artificial Intelligence Research (JAIR) 16 (2002). Google ScholarDigital Library
John Joseph Chilenski and Steven P Miller. 1994. Applicability of Modified Condition/Decision Coverage to Software Testing. Software Engineering Journal 9, 5 (1994).Google ScholarCross Ref
Henry Coles, Thomas Laurent, Christopher Henard, Mike Papadakis, and Anthony Ventresque. 2016. PIT: A Practical Mutation Testing Tool For Java. In Proc. 25th International Symposium on Software Testing and Analysis (ISSTA'16). ACM. Google ScholarDigital Library
Richard A DeMillo, Richard J Lipton, and Frederick G Sayward. 1978. Hints on Test Data Selection: Help for the Practicing Programmer. Computer 11,4 (1978). Google ScholarDigital Library
Vladimir N Fleyshgakker and Stewart N Weiss. 1994. Efficient Mutation Analysis: A New Approach. In Proc. 3rd International Symposium on Software Testing and Analysis (ISSTA'94). ACM. Google ScholarDigital Library
Rahul Gopinath, Amin Alipour, Iftekhar Ahmed, Carlos Jensen, and Alex Groce. 2015. How Hard Does Mutation Analysis Have to Be, Anyway?. In Proc. 26th International Symposium on Software Reliability Engineering (ISSRE'15). IEEE. Google ScholarDigital Library
Rahul Gopinath, Mohammad Amin Alipour, Iftekhar Ahmed, Carlos Jensen, and Alex Groce. 2016. On the Limits of Mutation Reduction Strategies. In Proc. 38th International Conference on Software Engineering (ICSE'16). IEEE. Google ScholarDigital Library
Bernhard JM Grün, David Schuler, and Andreas Zeller. 2009. The Impact of Equivalent Mutants. In Proc. International Conference on Software Testing, Verification and Validation Workshops (ICSTW'09). IEEE. Google ScholarDigital Library
Lars Heinemann, Benjamin Hummel, and Daniela Steidl. 2014. Teamscale: Software quality control in real-time. In Companion Proc. 36th International Conference on Software Engineering (ICSE'14 Companion). ACM. Google ScholarDigital Library
Hadi Hemmati. 2015. How Effective Are Code Coverage Criteria?. In Proc. 15th International Conference on Software Quality, Reliability and Security (QRS'15). IEEE. Google ScholarDigital Library
William E. Howden. 1982. Weak Mutation Testing and Completeness of Test Sets. IEEE Transactions on Software Engineering (TSE) 4 (1982). Google ScholarDigital Library
JC Huang. 1975. An Approach to Program Testing. ACM Computing Surveys (CSUR) 7, 3 (1975). Google ScholarDigital Library
Laura Inozemtseva and Reid Holmes. 2014. Coverage Is Not Strongly Correlated With Test Suite Effectiveness. In Proc. 36th International Conference on Software Engineering (ICSE'14). ACM. Google ScholarDigital Library
Goran Petrović Marko Ivanković, Bob Kurtz, Paul Ammann, and René Just. 2018. An Industrial Application of Mutation Testing: Lessons, Challenges, and Research Directions. In Proc. 13th International Workshop on Mutation Analysis (MUTATION'18).Google Scholar
Kevin Jalbert and Jeremy S Bradbury. 2012. Predicting mutation score using source code and test suite metrics. In Proc. 1st International Workshop on Realizing AI Synergies in Software Engineering (RAISE'12). IEEE. Google ScholarDigital Library
Changbin Ji, Zhenyu Chen, Baowen Xu, and Zhihong Zhao. 2009. A Novel Method of Mutation Clustering Based on Domain Analysis.. In Proc. 21st International Conference on Software Engineering and Knowledge Engineering (SEKE'09), Vol. 9.Google Scholar
Yue Jia and Mark Harman. 2008. Constructing Subtle Faults Using Higher Order Mutation Testing. In Proc. 8th International Working Conference on Source Code Analysis and Manipulation (SCAM'08). IEEE.Google ScholarCross Ref
Yue Jia and Mark Harman. 2011. An Analysis and Survey of the Development of Mutation Testing. Transactions on Software Engineering(TSE) 37, 5 (2011). Google ScholarDigital Library
René Just, Darioush Jalali, and Michael D Ernst. 2014. Defects4J: A Database of Existing Faults to Enable Controlled Testing Studies for Java Programs. In Proc. 23rd International Symposium on Software Testing and Analysis (ISSTA'14). ACM. Google ScholarDigital Library
Ron Kohavi and others. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, Vol. 14. Google ScholarDigital Library
Max Kuhn, the R Core Team, and further contributors. 2017. caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret R package version 6.0--76.Google Scholar
Richard J Lipton. 1971. Fault Diagnosis of Computer Programs. Technical Report.Google Scholar
Yu-Seung Ma, Jeff Offutt, and Yong Rae Kwon. 2005. MuJava: An Automated Class Mutation System. Software Testing, Verification and Reliability (STVR) 15, 2 (2005). Google ScholarDigital Library
Richard L. Mattson, Jan Gecsei, Donald R. Slutz, and Irving L. Traiger. 1970. Evaluation Techniques for Storage Hierarchies. IBM Systems Journal 9, 2 (1970). Google ScholarDigital Library
Rainer Niedermayr. 2018. TestAnalyzer. (2018). https://github.com/cqse/test-analyzer/ Computation of the Minimal Stack Distance (V3).Google Scholar
Rainer Niedermayr, Elmar Juergens, and Stefan Wagner. 2016. Will My Tests Tell Me If I Break This Code?. In Proc. 1st International Workshop on Continuous Software Evolution and Delivery (CSED'16). ACM. Google ScholarDigital Library
Rainer Niedermayr and Stefan Wagner. 2019. Dataset: Is the Stack Distance Between Method and Test Case Correlated With Test Effectiveness? (2019).Google Scholar
A Jefferson Offutt, Ammei Lee, Gregg Rothermel, Roland H Untch, and Christian Zapf. 1996. An Experimental Determination of Sufficient Mutant Operators. ACM Transactions on Software Engineering and Methodology (TOSEM) 5, 2 (1996). Google ScholarDigital Library
A Jefferson Offutt, Gregg Rothermel, and Christian Zapf. 1993. An Experimental Evaluation of Selective Mutation. In Proc. 15th International Conference on Software Engineering (ICSE'93). IEEE Computer Society Press. Google ScholarDigital Library
A Jefferson Offutt and Roland H Untch. 2001. Mutation 2000: Uniting the Orthogonal. In Mutation Testing for the New Century. Springer. Google ScholarDigital Library
Mike Papadakis, Marinos Kintis, Jie Zhang, Yue Jia, Yves Le Traon, and Mark Harman. 2017. Mutation Testing Advances: An Analysis and Survey. Advances in Computers (2017).Google Scholar
Sandra Rapps and Elaine J Weyuker. 1982. Data Flow Analysis Techniques for Test Data Selection. In Proc. 6th International Conference on Software Engineering (ICSE'82). IEEE Computer Society Press. Google ScholarDigital Library
David Schuler, Valentin Dallmeier, and Andreas Zeller. 2009. Efficient Mutation Testing by Checking Invariant Violations. In Proc. 18th International Symposium on Software Testing and Analysis (ISSTA'09). ACM. Google ScholarDigital Library
David Schuler and Andreas Zeller. 2013. Checked Coverage: An Indicator for Oracle Quality. Software Testing, Verification and Reliability (STVR) 23, 7 (2013).Google Scholar
Akbar Siami Namin, James H Andrews, and Duncan J Murdoch. 2008. Sufficient Mutation Operators for Measuring Test Effectiveness. In Proc. 30th International Conference on Software Engineering (ICSE'08). ACM. Google ScholarDigital Library
Joanna Strug and Barbara Strug. 2012. Machine learning approach in mutation testing. In Proc. 24th International Conference on Testing Software and Systems (ICTSS'12). Springer.Google ScholarCross Ref
Joanna Strug and Barbara Strug. 2018. Evaluation of the prediction-based approach to cost reduction in mutation testing. In Proc. 39th International Conference on Information Systems Architecture and Technology (ISAT'18). Springer.Google Scholar
Macario Polo Usaola and Pedro Reales Mateo. 2010. Mutation Testing Cost Reduction Techniques: A Survey. IEEE Software 27, 3 (2010). Google ScholarDigital Library
Oscar Luis Vera-Pérez, Martin Monperrus, and Benoit Baudry. 2018. Descartes: a PITest engine to detect pseudo-tested methods-tool demonstration. In Proc. 33rd International Conference on Automated Software Engineering (ASE'18). ACM Press. Google ScholarDigital Library
Jie Zhang, Lingming Zhang, Mark Harman, Dan Hao, Yue Jia, and Lu Zhang. 2018. Predictive Mutation Testing. Transactions on Software Engineering (TSE) (2018).Google Scholar
Hong Zhu, Patrick AV Hall, and John HR May. 1997. Software Unit Test Coverage and Adequacy. ACM Computing Surveys (CSUR) 29, 4 (1997). Google ScholarDigital Library

Index Terms

Is the Stack Distance Between Test Case and Method Correlated With Test Effectiveness?
1. Software and its engineering
  1. Software creation and management
    1. Software verification and validation
      1. Software defect analysis
        Software testing and debugging

Recommendations

Coverage is not strongly correlated with test suite effectiveness
ICSE 2014: Proceedings of the 36th International Conference on Software Engineering

The coverage of a test suite is often used as a proxy for its ability to detect faults. However, previous studies that investigated the correlation between code coverage and test suite effectiveness have failed to reach a consensus about the nature and ...
Read More
Comparing test quality measures for assessing student-written tests
ICSE Companion 2014: Companion Proceedings of the 36th International Conference on Software Engineering

Many educators now include software testing activities in programming assignments, so there is a growing demand for appropriate methods of assessing the quality of student-written software tests. While tests can be hand-graded, some educators also use ...
Read More
Checked Coverage and Object Branch Coverage: New Alternatives for Assessing Student-Written Tests
SIGCSE '15: Proceedings of the 46th ACM Technical Symposium on Computer Science Education

Many educators currently use code coverage metrics to assess student-written software tests. While test adequacy criteria such as statement or branch coverage can also be used to measure the thoroughness of a test suite, they have limitations. Coverage ...
Read More

Reviews

Reviewer: Vladik Kreinovich

In general, it is not algorithmically possible to always prove a program's correctness or incorrectness. So, in practice, a program's correctness is usually judged by testing the program on test cases from some test suite. How can we gauge the effectiveness of a test suite Practitioners use metrics like code coverage, that is, the proportion of the code that is executed in at least some of the test cases. However, this metric is not well correlated with test suite effectiveness: some test suites have high code coverage, but fail to detect faults. A more effective technique is mutation testing, where we add faults into different parts of the code and see if the tests detect these faults. However, this technique is very time-consuming and thus rarely used in practice. It is therefore desirable to come up with a measure of test suite effectiveness that is more accurate than code coverage and more practical than mutation testing. The authors take into account that, for each test, some methods are invoked directly while other methods are invoked indirectly via a chain of method calls. They show, crudely speaking, that the length of this chain-which they call stack distance-is highly correlated with test suite effectiveness in detecting the method's faults. To be more precise: for this correlation to appear, we need to slightly modify the definition of the length, for example, we should not count calls to external libraries (since these libraries are supposed to be already tested), we should not count calls to constructor methods (such methods rarely contain faults), and so on. An even better prediction of possibly faulty methods comes if we also take into account different versions of code coverage and other easy-to-compute characteristics, and train a neural network to predict the method's faultiness based on all of this information. This leads to a practically useful alternative to mutation testing. The paper is intended for specialists and practitioners in software engineering who are familiar with the basics of statistics. I recommend it to both theoreticians and practitioners.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EASE '19: Proceedings of the 23rd International Conference on Evaluation and Assessment in Software Engineering
April 2019
345 pages
ISBN:9781450371452
DOI:10.1145/3319008
Program Chairs:
Shaukat Ali,
Vahid Garousi
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 April 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
minimal stack distance
mutation test prediction
software testing
test effectiveness
test metrics
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
EASE '19 Paper Acceptance Rate20of73submissions,27%Overall Acceptance Rate71of232submissions,31%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 139
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Is the Stack Distance Between Test Case and Method Correlated With Test Effectiveness?

EASE '19: Proceedings of the 23rd International Conference on Evaluation and Assessment in Software Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Coverage is not strongly correlated with test suite effectiveness

Comparing test quality measures for assessing student-written tests

Checked Coverage and Object Branch Coverage: New Alternatives for Assessing Student-Written Tests

Reviews

Access critical reviews of Computing literature here