research-article

What is in our datasets?: describing a structure of datasets

Authors:
Marshima Mohd Rosli

The University of Auckland, Auckland, New Zealand

The University of Auckland, Auckland, New Zealand
View Profile

,
Ewan Tempero

The University of Auckland, Auckland, New Zealand

The University of Auckland, Auckland, New Zealand
View Profile

,
Andrew Luxton-Reilly

The University of Auckland, Auckland, New Zealand

The University of Auckland, Auckland, New Zealand
View Profile

ACSW '16: Proceedings of the Australasian Computer Science Week MulticonferenceFebruary 2016Article No.: 28Pages 1–10https://doi.org/10.1145/2843043.2843059

Published:01 February 2016Publication History

ACSW '16: Proceedings of the Australasian Computer Science Week Multiconference

Pages 1–10

ABSTRACT

In order to facilitate research based on datasets in empirical software engineering, the meaning of data must be able to be interpreted correctly. Datasets contain measurements that are associated with metrics and entities. In some datasets, it is not always clear which entities have been measured and exactly which metrics have been used. This means that measurements could be misinterpreted. The goal of this study is to determine a useful way to understand what datasets are actually intended to represent. We construct precise definitions of datasets and their potential elements. We develop a metamodel to describe the structure and concepts in a dataset, and the relationships between each concept. We apply the metamodel to a number of existing datasets from the PROMISE repository. We found that of the 70 existing datasets we studied, 61 datasets contained insufficient information to ensure correct interpretation for metrics and entities. Our metamodel can be used to identify such datasets and can be used to evaluate new datasets. It will also form the foundation for a framework to evaluate the quality of datasets.

References

IEEE Standard Glossary of Software Engineering Terminology, IEEE Std 610.12-1990, 1990.Google Scholar
A. Bachmann and A. Bernstein. Software process data quality and characteristics: A historical view on open and closed source projects. In Int. and Annual ERCIM Workshops on Principles of Software Evolution (IWPSE) and Software Evolution (Evol) workshops, pages 119--128. ACM, 2009. Google ScholarDigital Library
A. Bachmann, C. Bird, F. Rahman, P. Devanbu, and A. Bernstein. The missing links: Bugs and bug-fix commits. In Int. Symp. on Foundations of Software Engineering, pages 97--106. ACM, 2010. Google ScholarDigital Library
M. F. Bosu and S. G. MacDonell. Data quality in empirical software engineering: a targeted review. In Int. Conf. on Evaluation and Assessment in Software Engineering, pages 171--176. ACM, 2013. Google ScholarDigital Library
L. Cheikhi and A. Abran. An Analysis of the PROMISE and ISBSG Software Engineering Data Repositories. Int. Journal of Computers and Technology, 13(5), 2014.Google Scholar
L. Chirinos, F. Losavio, and J. Bø egh. Characterizing a data model for software measurement. Journal of Systems and Software, 74(2):207--226, Jan. 2005. Google ScholarDigital Library
N. E. Fenton and S. L. Pfleeger. Software Metrics: A Rigorous and Practical Approach. PWS Publishing Co., Boston, MA, USA, 2nd edition, 1998. Google ScholarDigital Library
H. Galhardas and J. Barateiro. A survey of data quality tools. Datenbank-Spektrum, 1:14:15--21, 2005.Google Scholar
F. García, M. F. Bertoa, C. Calero, A. Vallecillo, F. Ruíz, M. Piattini, and M. Genero. Towards a consistent terminology for software measurement. Journal of Information and Software Technology, 48(8):631--644, Aug. 2006.Google ScholarCross Ref
D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. The misuse of the NASA Metrics Data Program data sets for automated software defect prediction. In Int. Conf. on Evaluation & Assessment in Software Engineering, pages 96--103. IET, 2011.Google ScholarCross Ref
D. Gray, D. Bowes, N. Davey, Y. Sun, and B. Christianson. Reflections on the NASA MDP data sets. IET Software, 6(6):549, 2012.Google ScholarCross Ref
M. Jureczko and L. Madeyski. Towards identifying software project clusters with regard to defect prediction. In Int. Conf. on Predictive Models in Software Engineering, pages 9:1--9:10, New York, USA, 2010. ACM. Google ScholarDigital Library
T. Khoshgoftaar and J. Van Hulse. Identifying noise in an attribute of interest. In Int. Conf. on Machine Learning and Applications, page 6. IEEE, 2005. Google ScholarDigital Library
H. M. Kim, A. Sengupta, M. S. Fox, and M. Dalkilic. A Measurement Ontology Generalizable for Emerging Domain Applications on the Semantic Web. Journal of Database Management, 18(1):20--42, 2007.Google ScholarCross Ref
S. Kim, H. Zhang, R. Wu, and L. Gong. Dealing with noise in defect prediction. In Int. Conf. on Software Engineering, pages 481--490. IEEE, 2011. Google ScholarDigital Library
B. Kitchenham. Modeling software measurement data. IEEE Transactions on Software Engineering, 27(9):788--804, 2001. Google ScholarDigital Library
B. Kitchenham, S. L. Pfleeger, and N. Fenton. Towards a framework for software measurement validation. IEEE Transactions on Software Engineering, 21(12):929--944, 1995. Google ScholarDigital Library
G. A. Liebchen and M. Shepperd. Data sets and data quality in software engineering. In Int. Workshop on Predictor Models in Software Engineering, pages 39--44. ACM, 2008. Google ScholarDigital Library
T. Menzies, B. Caglayan, Z. He, E. Kocaguneli, J. Krall, F. Peters, and B. Turhan. The PROMISE repository of empirical software engineering data, June 2012.Google Scholar
A. Mockus. Missing data in software engineering. Guide to Advanced Empirical Software Engineering, page 185, 2008.Google Scholar
D. Rodriguez, I. Herraiz, and R. Harrison. On software engineering repositories and their open problems. In Int. Workshop on Realizing Artificial Intelligence Synergies in Software Engineering, pages 52--56. IEEE, 2012. Google ScholarDigital Library
M. Rosli, E. Tempero, and A. Luxton-Reilly. Can we trust our results? a mapping study on data quality. In Asia-Pacific Software Engineering Conf., volume 1, pages 116--123. IEEE, Dec 2013. Google ScholarDigital Library
M. Shepperd. Data quality: Cinderella at the software metrics ball? In Int. Workshop on Emerging Trends in Software Metrics, pages 1--4. ACM, 2011. Google ScholarDigital Library
M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data quality: Some comments on the nasa software defect datasets. IEEE Transactions on Software Engineering, 39(9):1208--1215, Sept 2013. Google ScholarDigital Library
E. Tempero, C. Anslow, J. Dietrich, T. Han, J. Li, M. Lumpe, H. Melton, and J. Noble. The Qualitas Corpus: A Curated Collection of Java Code for Empirical Studies. In Asia Pacific Software Engineering Conf., pages 336--345. IEEE, Nov. 2010. Google ScholarDigital Library
W. Zhang, Y. Yang, and Q. Wang. Handling missing data in software effort prediction with naive Bayes and EM algorithm. In Int. Conf. on Predictive Models in Software Engineering, page 4. ACM, 2011. Google ScholarDigital Library

What is in our datasets?: describing a structure of datasets
1. General and reference
  1. Cross-computing tools and techniques
    1. Metrics

Recommendations

Benchmark Datasets for Fault Detection and Classification in Sensor Data
SENSORNETS 2016: Proceedings of the 5th International Confererence on Sensor Networks

Data measured and collected from embedded sensors often contains faults, i.e., data points which are not an

accurate representation of the physical phenomenon monitored by the sensor. These data faults may be caused

by deployment conditions outside the ...
Read More
Fingerprinting and Building Large Reproducible Datasets
ACM REP '23: Proceedings of the 2023 ACM Conference on Reproducibility and Replicability

Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. ...
Read More
Detecting nearly duplicated records in location datasets
GIS '10: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems

The quality of a local search engine, such as Google and Bing Maps, heavily relies on its geographic datasets. Typically, these datasets are obtained from multiple sources, e.g., different vendors or public yellow-page websites. Therefore, the same ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ACSW '16: Proceedings of the Australasian Computer Science Week Multiconference
February 2016
654 pages
ISBN:9781450340427
DOI:10.1145/2843043

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 February 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data quality
empirical studies
modeling datasets
software engineering datasets
Qualifiers
- research-article
Conference

Acceptance Rates
ACSW '16 Paper Acceptance Rate77of172submissions,45%Overall Acceptance Rate204of424submissions,48%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 184
  Total Downloads
- Downloads (Last 12 months)27
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

What is in our datasets?: describing a structure of datasets

ACSW '16: Proceedings of the Australasian Computer Science Week Multiconference

ABSTRACT

References

Cited By

Recommendations

Benchmark Datasets for Fault Detection and Classification in Sensor Data

Fingerprinting and Building Large Reproducible Datasets

Detecting nearly duplicated records in location datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

What is in our datasets?: describing a structure of datasets

ACSW '16: Proceedings of the Australasian Computer Science Week Multiconference

ABSTRACT

References

Cited By

Recommendations

Benchmark Datasets for Fault Detection and Classification in Sensor Data

Fingerprinting and Building Large Reproducible Datasets

Detecting nearly duplicated records in location datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media