ABSTRACT
In order to facilitate research based on datasets in empirical software engineering, the meaning of data must be able to be interpreted correctly. Datasets contain measurements that are associated with metrics and entities. In some datasets, it is not always clear which entities have been measured and exactly which metrics have been used. This means that measurements could be misinterpreted. The goal of this study is to determine a useful way to understand what datasets are actually intended to represent. We construct precise definitions of datasets and their potential elements. We develop a metamodel to describe the structure and concepts in a dataset, and the relationships between each concept. We apply the metamodel to a number of existing datasets from the PROMISE repository. We found that of the 70 existing datasets we studied, 61 datasets contained insufficient information to ensure correct interpretation for metrics and entities. Our metamodel can be used to identify such datasets and can be used to evaluate new datasets. It will also form the foundation for a framework to evaluate the quality of datasets.
- IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12-1990, 1990.Google Scholar
- A. Bachmann and A. Bernstein. Software process data quality and characteristics: A historical view on open and closed source projects. In Int. and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) workshops, pages 119--128. ACM, 2009. Google ScholarDigital Library
- A. Bachmann, C. Bird, F. Rahman, P. Devanbu, and A. Bernstein. The missing links: Bugs and bug-fix commits. In Int. Symp. on Foundations of Software Engineering, pages 97--106. ACM, 2010. Google ScholarDigital Library
- M. F. Bosu and S. G. MacDonell. Data quality in empirical software engineering: a targeted review. In Int. Conf. on Evaluation and Assessment in Software Engineering, pages 171--176. ACM, 2013. Google ScholarDigital Library
- L. Cheikhi and A. Abran. An Analysis of the PROMISE and ISBSG Software Engineering Data Repositories. Int. Journal of Computers and Technology, 13(5), 2014.Google Scholar
- L. Chirinos, F. Losavio, and J. Bø egh. Characterizing a data model for software measurement. Journal of Systems and Software, 74(2):207--226, Jan. 2005. Google ScholarDigital Library
- N. E. Fenton and S. L. Pfleeger. Software Metrics: A Rigorous and Practical Approach. PWS Publishing Co., Boston, MA, USA, 2nd edition, 1998. Google ScholarDigital Library
- H. Galhardas and J. Barateiro. A survey of data quality tools. Datenbank-Spektrum, 1:14:15--21, 2005.Google Scholar
- F. García, M. F. Bertoa, C. Calero, A. Vallecillo, F. Ruíz, M. Piattini, and M. Genero. Towards a consistent terminology for software measurement. Journal of Information and Software Technology, 48(8):631--644, Aug. 2006.Google ScholarCross Ref
- D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. The misuse of the NASA Metrics Data Program data sets for automated software defect prediction. In Int. Conf. on Evaluation & Assessment in Software Engineering, pages 96--103. IET, 2011.Google ScholarCross Ref
- D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. Reflections on the NASA MDP data sets. IET Software, 6(6):549, 2012.Google ScholarCross Ref
- M. Jureczko and L. Madeyski. Towards identifying software project clusters with regard to defect prediction. In Int. Conf. on Predictive Models in Software Engineering, pages 9:1--9:10, New York, USA, 2010. ACM. Google ScholarDigital Library
- T. Khoshgoftaar and J. Van Hulse. Identifying noise in an attribute of interest. In Int. Conf. on Machine Learning and Applications, page 6. IEEE, 2005. Google ScholarDigital Library
- H. M. Kim, A. Sengupta, M. S. Fox, and M. Dalkilic. A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web. Journal of Database Management, 18(1):20--42, 2007.Google ScholarCross Ref
- S. Kim, H. Zhang, R. Wu, and L. Gong. Dealing with noise in defect prediction. In Int. Conf. on Software Engineering, pages 481--490. IEEE, 2011. Google ScholarDigital Library
- B. Kitchenham. Modeling software measurement data. IEEE Transactions on Software Engineering, 27(9):788--804, 2001. Google ScholarDigital Library
- B. Kitchenham, S. L. Pfleeger, and N. Fenton. Towards a framework for software measurement validation. IEEE Transactions on Software Engineering, 21(12):929--944, 1995. Google ScholarDigital Library
- G. A. Liebchen and M. Shepperd. Data sets and data quality in software engineering. In Int. Workshop on Predictor Models in Software Engineering, pages 39--44. ACM, 2008. Google ScholarDigital Library
- T. Menzies, B. Caglayan, Z. He, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The PROMISE repository of empirical software engineering data, June 2012.Google Scholar
- A. Mockus. Missing data in software engineering. Guide to Advanced Empirical Software Engineering, page 185, 2008.Google Scholar
- D. Rodriguez, I. Herraiz, and R. Harrison. On software engineering repositories and their open problems. In Int. Workshop on Realizing Artificial Intelligence Synergies in Software Engineering, pages 52--56. IEEE, 2012. Google ScholarDigital Library
- M. Rosli, E. Tempero, and A. Luxton-Reilly. Can we trust our results? a mapping study on data quality. In Asia-Pacific Software Engineering Conf., volume 1, pages 116--123. IEEE, Dec 2013. Google ScholarDigital Library
- M. Shepperd. Data quality: Cinderella at the software metrics ball? In Int. Workshop on Emerging Trends in Software Metrics, pages 1--4. ACM, 2011. Google ScholarDigital Library
- M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality: Some comments on the nasa software defect datasets. IEEE Transactions on Software Engineering, 39(9):1208--1215, Sept 2013. Google ScholarDigital Library
- E. Tempero, C. Anslow, J. Dietrich, T. Han, J. Li, M. Lumpe, H. Melton, and J. Noble. The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies. In Asia Pacific Software Engineering Conf., pages 336--345. IEEE, Nov. 2010. Google ScholarDigital Library
- W. Zhang, Y. Yang, and Q. Wang. Handling missing data in software effort prediction with naive Bayes and EM algorithm. In Int. Conf. on Predictive Models in Software Engineering, page 4. ACM, 2011. Google ScholarDigital Library
- What is in our datasets?: describing a structure of datasets
Recommendations
Benchmark Datasets for Fault Detection and Classification in Sensor Data
SENSORNETS 2016: Proceedings of the 5th International Confererence on Sensor NetworksData measured and collected from embedded sensors often contains faults, i.e., data points which are not an
accurate representation of the physical phenomenon monitored by the sensor. These data faults may be caused
by deployment conditions outside the ...
Fingerprinting and Building Large Reproducible Datasets
ACM REP '23: Proceedings of the 2023 ACM Conference on Reproducibility and ReplicabilityObtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. ...
Detecting nearly duplicated records in location datasets
GIS '10: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information SystemsThe quality of a local search engine, such as Google and Bing Maps, heavily relies on its geographic datasets. Typically, these datasets are obtained from multiple sources, e.g., different vendors or public yellow-page websites. Therefore, the same ...
Comments