skip to main content
10.1145/2843043.2843059acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesaus-cswConference Proceedingsconference-collections
research-article

What is in our datasets?: describing a structure of datasets

Published:01 February 2016Publication History

ABSTRACT

In order to facilitate research based on datasets in empirical software engineering, the meaning of data must be able to be interpreted correctly. Datasets contain measurements that are associated with metrics and entities. In some datasets, it is not always clear which entities have been measured and exactly which metrics have been used. This means that measurements could be misinterpreted. The goal of this study is to determine a useful way to understand what datasets are actually intended to represent. We construct precise definitions of datasets and their potential elements. We develop a metamodel to describe the structure and concepts in a dataset, and the relationships between each concept. We apply the metamodel to a number of existing datasets from the PROMISE repository. We found that of the 70 existing datasets we studied, 61 datasets contained insufficient information to ensure correct interpretation for metrics and entities. Our metamodel can be used to identify such datasets and can be used to evaluate new datasets. It will also form the foundation for a framework to evaluate the quality of datasets.

References

  1. IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12-1990, 1990.Google ScholarGoogle Scholar
  2. A. Bachmann and A. Bernstein. Software process data quality and characteristics: A historical view on open and closed source projects. In Int. and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) workshops, pages 119--128. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Bachmann, C. Bird, F. Rahman, P. Devanbu, and A. Bernstein. The missing links: Bugs and bug-fix commits. In Int. Symp. on Foundations of Software Engineering, pages 97--106. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. F. Bosu and S. G. MacDonell. Data quality in empirical software engineering: a targeted review. In Int. Conf. on Evaluation and Assessment in Software Engineering, pages 171--176. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Cheikhi and A. Abran. An Analysis of the PROMISE and ISBSG Software Engineering Data Repositories. Int. Journal of Computers and Technology, 13(5), 2014.Google ScholarGoogle Scholar
  6. L. Chirinos, F. Losavio, and J. Bø egh. Characterizing a data model for software measurement. Journal of Systems and Software, 74(2):207--226, Jan. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. E. Fenton and S. L. Pfleeger. Software Metrics: A Rigorous and Practical Approach. PWS Publishing Co., Boston, MA, USA, 2nd edition, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Galhardas and J. Barateiro. A survey of data quality tools. Datenbank-Spektrum, 1:14:15--21, 2005.Google ScholarGoogle Scholar
  9. F. García, M. F. Bertoa, C. Calero, A. Vallecillo, F. Ruíz, M. Piattini, and M. Genero. Towards a consistent terminology for software measurement. Journal of Information and Software Technology, 48(8):631--644, Aug. 2006.Google ScholarGoogle ScholarCross RefCross Ref
  10. D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. The misuse of the NASA Metrics Data Program data sets for automated software defect prediction. In Int. Conf. on Evaluation & Assessment in Software Engineering, pages 96--103. IET, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  11. D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. Reflections on the NASA MDP data sets. IET Software, 6(6):549, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  12. M. Jureczko and L. Madeyski. Towards identifying software project clusters with regard to defect prediction. In Int. Conf. on Predictive Models in Software Engineering, pages 9:1--9:10, New York, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Khoshgoftaar and J. Van Hulse. Identifying noise in an attribute of interest. In Int. Conf. on Machine Learning and Applications, page 6. IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. H. M. Kim, A. Sengupta, M. S. Fox, and M. Dalkilic. A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web. Journal of Database Management, 18(1):20--42, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  15. S. Kim, H. Zhang, R. Wu, and L. Gong. Dealing with noise in defect prediction. In Int. Conf. on Software Engineering, pages 481--490. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Kitchenham. Modeling software measurement data. IEEE Transactions on Software Engineering, 27(9):788--804, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Kitchenham, S. L. Pfleeger, and N. Fenton. Towards a framework for software measurement validation. IEEE Transactions on Software Engineering, 21(12):929--944, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. A. Liebchen and M. Shepperd. Data sets and data quality in software engineering. In Int. Workshop on Predictor Models in Software Engineering, pages 39--44. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Menzies, B. Caglayan, Z. He, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The PROMISE repository of empirical software engineering data, June 2012.Google ScholarGoogle Scholar
  20. A. Mockus. Missing data in software engineering. Guide to Advanced Empirical Software Engineering, page 185, 2008.Google ScholarGoogle Scholar
  21. D. Rodriguez, I. Herraiz, and R. Harrison. On software engineering repositories and their open problems. In Int. Workshop on Realizing Artificial Intelligence Synergies in Software Engineering, pages 52--56. IEEE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Rosli, E. Tempero, and A. Luxton-Reilly. Can we trust our results? a mapping study on data quality. In Asia-Pacific Software Engineering Conf., volume 1, pages 116--123. IEEE, Dec 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Shepperd. Data quality: Cinderella at the software metrics ball? In Int. Workshop on Emerging Trends in Software Metrics, pages 1--4. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality: Some comments on the nasa software defect datasets. IEEE Transactions on Software Engineering, 39(9):1208--1215, Sept 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. E. Tempero, C. Anslow, J. Dietrich, T. Han, J. Li, M. Lumpe, H. Melton, and J. Noble. The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies. In Asia Pacific Software Engineering Conf., pages 336--345. IEEE, Nov. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. Zhang, Y. Yang, and Q. Wang. Handling missing data in software effort prediction with naive Bayes and EM algorithm. In Int. Conf. on Predictive Models in Software Engineering, page 4. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. What is in our datasets?: describing a structure of datasets

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      ACSW '16: Proceedings of the Australasian Computer Science Week Multiconference
      February 2016
      654 pages
      ISBN:9781450340427
      DOI:10.1145/2843043

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 February 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      ACSW '16 Paper Acceptance Rate77of172submissions,45%Overall Acceptance Rate204of424submissions,48%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader