skip to main content
10.1145/2901739.2901753acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Adressing problems with external validity of repository mining studies through a smart data platform

Published: 14 May 2016 Publication History

Abstract

Research in software repository mining has grown considerably the last decade. Due to the data-driven nature of this venue of investigation, we identified several problems within the current state-of-the-art that pose a threat to the external validity of results. The heavy re-use of data sets in many studies may invalidate the results in case problems with the data itself are identified. Moreover, for many studies data and/or the implementations are not available, which hinders a replication of the results and, thereby, decreases the comparability between studies. Even if all information about the studies is available, the diversity of the used tooling can make their replication even then very hard. Within this paper, we discuss a potential solution to these problems through a cloud-based platform that integrates data collection and analytics. We created the prototype SmartSHARK that implements our approach. Using SmartSHARK, we collected data from several projects and created different analytic examples. Within this article, we present SmartSHARK and discuss our experiences regarding the use of SmartSHARK and the mentioned problems.

References

[1]
01org. Libxcam GitHub. https://github.com/01org/libxcam. {accessed 22-January-2015}.
[2]
01org. Libyami GitHub. https://github.com/01org/libyami. {accessed 22-January-2015}.
[3]
01org. Wds GitHub. https://github.com/01org/wds. {accessed 22-January-2015}.
[4]
C. V. Alexandru and H. C. Gall. Rapid Multi-Purpose, Multi-Commit Code Analysis. In Proceedings of the IEEE/ACM 37th International Conference on Software Engineering (ICSE), pages 635--638. IEEE/ACM, 2015.
[5]
Ansible Inc. Ansible Documentation. http://www.ansible.com/. {accessed 22-January-2015}.
[6]
Apache Software Foundation. Apache Hadoop. https://hadoop.apache.org/. {accessed 22-January-2015}.
[7]
Apache Software Foundation. Apache Hadoop Wiki. https://wiki.apache.org/hadoop/HowToDebugMapReducePrograms. {accessed 22-January-2015}.
[8]
Apache Software Foundation. Apache Spark GraphX. http://spark.apache.org/graphx/. {accessed 01-March-2016}.
[9]
Apache Software Foundation. Apache Spark MLLib. http://spark.apache.org/docs/latest/mllib-guide.html. {accessed 22-January-2015}.
[10]
Apache Software Foundation. Log4j GitHub. https://github.com/apache/log4j. {accessed 22-January-2015}.
[11]
Apache Software Foundation. Mahout GitHub. https://github.com/apache/mahout. {accessed 22-January-2015}.
[12]
J. Bevan, E. J. Whitehead Jr, S. Kim, and M. Godfrey. Facilitating software evolution research with kenyon. In ACM SIGSOFT Software Engineering Notes, volume 30, pages 177--186. ACM, 2005.
[13]
Bitergia. Bitergia. http://bitergia.com/. {accessed 22-January-2015}.
[14]
Black Duck Software, Inc. Open HUB. https://www.openhub.net/. {accessed 22-January-2015}.
[15]
C. Catal and B. Diri. A systematic review of software fault prediction studies. Expert Systems with Applications, 36(4):7346--7354, 2009.
[16]
Cloudera. Oryx GitHub. https://github.com/cloudera/oryx. {accessed 22-January-2015}.
[17]
D. Čubranić, G. C. Murphy, J. Singer, and K. S. Booth. Hipikat: A project memory for software development. IEEE Transactions on Software Engineering, 31(6):446--465, 2005.
[18]
J. Czerwonka, N. Nagappan, and W. Schulte. CODEMINE: Building a Software Development Data Analytics Platform at Microsoft. IEEE Software, 30(4):64--71, 2013.
[19]
D. Di Ruscio, D. S. Kolovos, I. Korkontzelos, N. Matragkas, and J. Vinju. Ossmeter: A software measurement platform for automatically analysing open source software projects. In ESEC/FSE 2015 Tool Demonstrations Track, 2015.
[20]
A. Di Sorbo, S. Panichella, C. Visaggio, M. Di Penta, G. Canfora, and H. Gall. Development emails content analyzer: Intention mining in developer discussions. In Proceedings of the IEEE/ACM 30th International Conference on Automated Software Engineering (ASE), 2015.
[21]
Distributed Machine Learning Common. Cxxnet GitHub. https://github.com/dmlc/cxxnet. {accessed 22-January-2015}.
[22]
Distributed Machine Learning Common. Mxnet GitHub. https://github.com/dmlc/mxnet. {accessed 22-January-2015}.
[23]
Distributed Machine Learning Common. Xgboost GitHub. https://github.com/dmlc/xgboost. {accessed 22-January-2015}.
[24]
U. Draisbach and F. Naumann. Dude: The duplicate detection toolkit. In Proceedings of the International Workshop on Quality in Databases (QDB), 2010.
[25]
R. Dyer, H. A. Nguyen, H. Rajan, and T. Nguyen. Boa: Ultra-Large-Scale Software Repository and Source Code Mining. ACM Transactions on Software Engineering and Methodology, forthcoming, 2015.
[26]
R. Dyer, H. A. Nguyen, H. Rajan, and T. N. Nguyen. Boa: A language and infrastructure for analyzing ultra-large-scale software repositories. In Proceedings of the IEEE/ACM 35th International Conference on Software Engineering (ICSE), 2013.
[27]
Elasticsearch BV. Elasticsearch-hadoop GitHub. https://github.com/elastic/elasticsearch-hadoop. {accessed 22-January-2015}.
[28]
Fabian Trautsch. SmartSHARK Homepage. http://smartshark.informatik.uni-goettingen.de. {accessed 22-January-2015}.
[29]
Facebook Inc. Fatal GitHub. https://github.com/facebook/fatal. {accessed 22-January-2015}.
[30]
Facebook Inc. Osquery GitHub. https://github.com/facebook/osquery. {accessed 22-January-2015}.
[31]
Facebook Inc. Swift GitHub. https://github.com/facebook/swift. {accessed 22-January-2015}.
[32]
J. Fernandez-Ramil, D. Izquierdo-Cortazar, and T. Mens. What does it take to develop a million lines of open source code? In Open Source Ecosystems: Diverse Communities Interacting, pages 170--184. Springer, 2009.
[33]
E. Fjellskål. Passivedbs GitHub. https://github.com/gamelinux/passivedns. {accessed 22-January-2015}.
[34]
Free Software Foundation. GNU Diffutils. http://www.gnu.org/software/diffutils/. {accessed 22-January-2015}.
[35]
D. M. German. Mining CVS repositories, the softChange experience. Evolution, 245(5,402):92--688, 2004.
[36]
E. Giger, M. Pinzger, and H. Gall. Predicting the fix time of bugs. In Proceedings of the 2nd International Workshop on Recommendation Systems for Software Engineering (RSSE), pages 52--56. ACM, 2010.
[37]
I. GitHub. GitHub. https://github.com/.
[38]
M. Godfrey and Q. Tu. Tracking structural evolution using origin analysis. In Proceedings of the International Workshop on Principles of Software Evolution (IWPSE), 2002.
[39]
Google. Guice GitHub. https://github.com/google/guice. {accessed 22-January-2015}.
[40]
Google. Ohmu GitHub. https://github.com/google/ohmu. {accessed 22-January-2015}.
[41]
G. Gousios and D. Spinellis. Alitheia core: An extensible software quality monitoring platform. In Proceedings of the IEEE/ACM 31st International Conference on Software Engineering (ICSE), 2009.
[42]
G. Gousios and D. Spinellis. Ghtorrent: Github's data from a firehose. In Proceedings of the 9th IEEE Working Conference on Mining Software Repositories (MSR), pages 12--21. IEEE, 2012.
[43]
G. Gousios, B. Vasilescu, A. Serebrenik, and A. Zaidman. Lean ghtorrent: Github data on demand. In Proceedings of the 11th IEEE Working Conference on Mining Software Repositories (MSR), pages 384--387. ACM, 2014.
[44]
I. Grigorik. GitHub Archive. https://www.githubarchive.org/. {accessed 22-January-2015}.
[45]
T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell. A systematic literature review on fault prediction performance in software engineering. IEEE Transactions on Software Engineering, 38(6): 1276--1304, Nov 2012.
[46]
HashiCorp. Vagrant. https://www.vagrantup.com/. {accessed 22-January-2015}.
[47]
G. Hecht, B. Omar, R. Rouvoy, N. Moha, and L. Duchien. Tracking the software quality of android applications along their evolution. In Proceedings of the IEEE/ACM 30th International Conference on Automated Software Engineering (ASE), page 12. IEEE, 2015.
[48]
I. Herraiz, J. M. Gonzalez-Barahona, and G. Robles. Forecasting the number of changes in Eclipse using time series analysis. In Proceedings of the 4th IEEE Working Conference on Mining Software Repositories (MSR), 2007.
[49]
I. Herraiz, G. Robles, J. J. Amor, T. Romera, and J. M. González Barahona. The processes of joining in global distributed software projects. In Proceedings of the 2006 International Workshop on Global Software Development for the Practitioner, pages 27--33. ACM, 2006.
[50]
V. Honsel, D. Honsel, S. Herbold, J. Grabowski, and S. Waack. Mining Software Dependency Networks for Agent-Based Simulation of Software Evolution. In Proceedings of the 4th International Workshop on Software Mining (SoftMine), 2015.
[51]
J. Howison, M. S. Conklin, and K. Crowston. Ossmole: A collaborative repository for floss research data and analyses. In Proceedings of the 1st International Conference on Open Source Software, 2005.
[52]
Intooitus. InFamix. https://www.intooitus.com/company/news/introducing-infamix-free-ccjava-parser-moose. {accessed 22-January-2015}.
[53]
ISO/IEC. 9241-11 Ergonomic requirements for office work with visual display terminals (VDTs). ISO/IEC 9241-14, 1998.
[54]
A. Jermakovics, A. Sillitti, and G. Succi. Mining and visualizing developer networks from version control systems. In Proceedings of the 4th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE), CHASE '11, pages 24--31, New York, NY, USA, 2011. ACM.
[55]
M. Jorgensen and M. Shepperd. A systematic review of software development cost estimation studies. IEEE Transactions on Software Engineering, 33(1):33--53, Jan 2007.
[56]
KDE. K3b GitHub. https://github.com/KDE/k3b. {accessed 22-January-2015}.
[57]
KDE. KDE games developer mailing list. https://mail.kde.org/pipermail/kde-games-devel. {accessed 22-January-2015}.
[58]
KDE. Ksudoku GitHub. https://github.com/KDE/ksudoku. {accessed 22-January-2015}.
[59]
E. Lawlor. HackerNews GitHub. https://github.com/lawloretienne/HackerNews. {accessed 22-January-2015}.
[60]
E. Lawlor. Minesweeper GitHub. https://github.com/lawloretienne/Minesweeper. {accessed 22-January-2015}.
[61]
Machine Learning Group at the University of Waikato. WEKA. http://www.cs.waikato.ac.nz/ml/weka/. {accessed 22-January-2015}.
[62]
P. Makedonski and J. Grabowski. Weighted Multi-Factor Multi-Layer Identification of Potential Causes for Events of Interest in Software Repositories. In Proceedings of the Seminar Series on Advanced Techniques and Tools for Software Evolution (SATToSE) 2015. Forthcoming 2016.
[63]
P. Makedonski, F. Sudau, and J. Grabowski. Towards a model-based software mining infrastructure. ACM SIGSOFT Software Engineering Notes, 40(1):1--8, 2015.
[64]
T. Menzies, M. Rees-Jones, R. Krishna, and C. Pape. The promise repository of empirical software engineering data. http://openscience.us/repo. North Carolina State University, Department of Computer Science {accessed 22-January-2015}.
[65]
Metrics Grimoire. CVSAnaly GitHub. http://github.com/MetricsGrimoire/CVSAnalY. {accessed 22-January-2015}.
[66]
R Foundation. R Project. https://www.r-project.org/. {accessed 22-January-2015}.
[67]
R. Saito. Oclint GitHub. https://github.com/oclint/oclint. {accessed 22-January-2015}.
[68]
M. Scheidgen, A. Zubow, J. Fischer, and T. H. Kolbe. Automated and transparent model fragmentation for persisting large models. Springer, 2012.
[69]
SFTtech. OpenAge GitHub. https://github.com/SFTtech/openage. {accessed 22-January-2015}.
[70]
M. Shepperd, Q. Song, Z. Sun, and C. Mair. Data Quality: Some Comments on the NASA Software Defect Datasets. IEEE Transactions on Software Engineering, 39(9):1208--1215, 2013.
[71]
M. Tan, L. Tan, S. Dara, and C. Mayeux. Online Defect Prediction for Imbalanced Data. In Proceedings of the IEEE/ACM 37th International Conference on Software Engineering (ICSE), 2015.
[72]
F. Trautsch. SmartSHARK MongoDB Design. http://smartshark.informatik.uni-goettingen.de/index.php?r=site%2Fmongodesign. {accessed 22-January-2015}.
[73]
M. Tytel. Cursynth GitHub. https://github.com/mtytel/cursynth. {accessed 22-January-2015}.
[74]
Ushahidi. SMSSync GitHub. https://github.com/ushahidi/SMSSync. {accessed 22-January-2015}.
[75]
J. Walden, J. Stuckman, and R. Scandariato. Predicting vulnerable components: Software metrics vs text mining. In Proceedings of the IEEE 25th International Symposium on Software Reliability Engineering (ISSRE), pages 23--33. IEEE, 2014.
[76]
R. Wettel. DuDe. http://www.inf.usi.ch/phd/wettel/dude.html. {accessed 22-January-2015}.
[77]
Yii Software LLC. Yii Framework. http://www.yiiframework.com/. {accessed 22-January-2015}.
[78]
Yii Software LLC. Yii Framework - Widgets. http://www.yiiframework.com/doc-2.0/guide-structure-widgets.html. {accessed 22-January-2015}.
[79]
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Conference on Network System Design and Implementation (NSDI), 2012.
[80]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud), 2010.

Cited By

View all
  • (2024)Prevalence and Prediction of Unseen Co-Changes: A Graph-Based Approach2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00155(1157-1167)Online publication date: 2-Jul-2024
  • (2021)Application of Machine Intelligence-Based Knowledge Graphs for Software EngineeringMethodologies and Applications of Computational Statistics for Machine Intelligence10.4018/978-1-7998-7701-1.ch010(186-202)Online publication date: 2021
  • (2020)Polyglot and Distributed Software Repository Mining with CrossflowProceedings of the 17th International Conference on Mining Software Repositories10.1145/3379597.3387481(374-384)Online publication date: 29-Jun-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MSR '16: Proceedings of the 13th International Conference on Mining Software Repositories
May 2016
544 pages
ISBN:9781450341868
DOI:10.1145/2901739
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 May 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. smart data
  2. software analytics
  3. software mining

Qualifiers

  • Research-article

Conference

ICSE '16
Sponsor:

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)1
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Prevalence and Prediction of Unseen Co-Changes: A Graph-Based Approach2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)10.1109/COMPSAC61105.2024.00155(1157-1167)Online publication date: 2-Jul-2024
  • (2021)Application of Machine Intelligence-Based Knowledge Graphs for Software EngineeringMethodologies and Applications of Computational Statistics for Machine Intelligence10.4018/978-1-7998-7701-1.ch010(186-202)Online publication date: 2021
  • (2020)Polyglot and Distributed Software Repository Mining with CrossflowProceedings of the 17th International Conference on Mining Software Repositories10.1145/3379597.3387481(374-384)Online publication date: 29-Jun-2020
  • (2020)Designing an Effective User Interface for Analyzing Software Repositories2020 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)10.1109/VL/HCC50065.2020.9127206(1-2)Online publication date: Aug-2020
  • (2019)Anticipatory development processes for reducing total ownership costs and schedulesSystems Engineering10.1002/sys.2149022:5(401-410)Online publication date: 7-May-2019
  • (2018)A scalable and efficient approach for compiling and analyzing commit historyProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement10.1145/3239235.3239237(1-10)Online publication date: 11-Oct-2018
  • (2018)RestmuleProceedings of the 15th International Conference on Mining Software Repositories10.1145/3196398.3196405(537-541)Online publication date: 28-May-2018
  • (2018)Addressing problems with replicability and validity of repository mining studies through a smart data platformEmpirical Software Engineering10.1007/s10664-017-9537-x23:2(1036-1083)Online publication date: 1-Apr-2018
  • (2018)Simulating Software Refactorings Based on Graph TransformationsSimulation Science10.1007/978-3-319-96271-9_10(161-175)Online publication date: 8-Aug-2018
  • (2018)Big Data, the Next Step in the Evolution of Educational Data AnalysisProceedings of the International Conference on Information Technology & Systems (ICITS 2018)10.1007/978-3-319-73450-7_14(138-147)Online publication date: 5-Jan-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media