skip to main content
10.1145/1882291.1882315acmconferencesArticle/Chapter ViewAbstractPublication PagesfseConference Proceedingsconference-collections
research-article

A study of the uniqueness of source code

Published: 07 November 2010 Publication History

Abstract

This paper presents the results of the first study of the uniqueness of source code. We define the uniqueness of a unit of source code with respect to the entire body of written software, which we approximate with a corpus of 420 million lines of source code. Our high-level methodology consists of examining a collection of 6,000 software projects and measuring the degree to which each project can be `assembled' solely from portions of this corpus, thus providing a precise measure of `uniqueness' that we call syntactic redundancy. We parameterized our study over a variety of variables, the most important of which being the level of granularity at which we view source code. Our suite of experiments together consumed approximately four months of CPU time, providing quantitative answers to the following questions: at what levels of granularity is software unique, and at a given level of granularity, how unique is software? While we believe these questions to be of intrinsic interest, we discuss possible applications to genetic programming and developer productivity tools.

References

[1]
R. Al-Ekram, C. Kapser, R. C. Holt, and M. W. Godfrey. Cloning by accident: an empirical study of source code cloning across software systems. In ISESE, pages 376--385, 2005.
[2]
S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and C. Lopes. Sourcerer: a search engine for open source code supporting structure-based search. In OOPSLA '06: Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications, pages 681--682, 2006.
[3]
B. S. Baker. On finding duplication and near-duplication in large software systems. In WCRE '95: Proceedings of the Second Working Conference on Reverse Engineering, page 86, 1995.
[4]
R. Cottrell, R. J. Walker, and J. Denzinger. Semi-automating small-scale source code reuse via structural correspondence. In SIGSOFT '08/FSE-16: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 214--225, 2008.
[5]
P. Devanbu, S. Karstu, W. Melo, and W. Thomas. Analytical and empirical evaluation of software reuse metrics. In ICSE '96: Proceedings of the 18th international conference on Software engineering, pages 189--199, 1996.
[6]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB '99: Proceedings of the 25th International Conference on Very Large Data Bases, pages 518--529, 1999.
[7]
L. Jiang, G. Misherghi, Z. Su, and S. Glondu. DECKARD: Scalable and accurate tree-based detection of code clones. In Proceedings of ICSE, pages 96--105, 2007.
[8]
L. Jiang and Z. Su. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of ISSTA, pages 81--92, 2009.
[9]
T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7):654--670, 2002.
[10]
J. R. Koza. Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge, MA, USA, 1992.
[11]
R. V. Krejcie and D. W. Morgan. Determining sample size for research activities. Educational and psychological measurement, 30:607--610, 1970.
[12]
O. A. L. Lemos, S. K. Bajracharya, J. Ossher, R. S. Morla, P. C. Masiero, P. Baldi, and C. V. Lopes. CodeGenie: using test-cases to search and reuse source code. In Proceedings of ASE, 2007.
[13]
S. Livieri, Y. Higo, M. Matushita, and K. Inoue. Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In Proceedings of ICSE, pages 106--115, 2007.
[14]
A. Mockus. Large-scale code reuse in open source software. In FLOSS '07: Proceedings of the First International Workshop on Emerging Trends in FLOSS Research and Development, page 7, 2007.
[15]
A. Mockus. Amassing and indexing a large sample of version control systems: Towards the census of public source code history. In MSR '09: Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, pages 11--20, 2009.
[16]
A. Nandi and H. V. Jagadish. Effective phrase prediction. In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, pages 219--230, 2007.
[17]
A. Podgurski and L. Pierce. Retrieving reusable software by sampling behavior. ACM Trans. Softw. Eng. Methodol., 2(3):286--303, 1993.
[18]
S. P. Reiss. Semantics-based code search. In ICSE '09: Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, pages 243--253, 2009.
[19]
S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 76--85, 2003.
[20]
W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest. Automatically finding patches using genetic programming. In ICSE '09: Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, pages 364--374, 2009.

Cited By

View all
  • (2024)Leveraging Phylogenetics in Software Product Families: The Case of Latent Content Generation in Video GamesProceedings of the 28th ACM International Systems and Software Product Line Conference10.1145/3646548.3672596(113-124)Online publication date: 2-Sep-2024
  • (2024)An Empirical Study of Code Clones: Density, Entropy, and PatternsScience of Computer Programming10.1016/j.scico.2024.103259(103259)Online publication date: Dec-2024
  • (2023)Natural Language to Code: How Far Are We?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616323(375-387)Online publication date: 30-Nov-2023
  • Show More Cited By

Recommendations

Reviews

Nancy S. Eickelmann

Gabel and Su propose syntactic redundancy as a new measure of the degree of software uniqueness. Syntactic redundancy is the percentage of matches between a user-selected code sequence of specific token composition and two sets of selected open-source projects. The efficacy of the methodology and validity of the results are not well documented. The study computes the syntactic redundancy percentage for two samples from 6,000 SourceForge projects: a sample of 30 SourceForge projects and a sample of a nondisclosed size. The latter sample is obtained using a uniform sampling strategy; the number of tokens used is stated as 1,500, which equates to about 250 lines of code. The authors imply that the study's findings extend-due to their generalizability-to all 6,000 projects based on the sample that was evaluated, despite their statement that "[they] have not formally formulated/tested any hypotheses, and [they] have not performed any statistical tests." The study has two independent variables: the number of tokens included in a matching sequence, designated as g (the level of granularity), and matching criteria at one of two values, with no abstractions or renamed identifiers. The matches are measured by documenting matches with Hamming distances for binary variables with distances of one to four. The results of the study, in all but one experiment, found no significant amount of redundancy at levels of g over 60. This equates to approximately ten lines of code. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
FSE '10: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering
November 2010
302 pages
ISBN:9781605587912
DOI:10.1145/1882291
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2010

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. large scale study
  2. software uniqueness
  3. source code

Qualifiers

  • Research-article

Conference

SIGSOFT/FSE'10
Sponsor:

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)67
  • Downloads (Last 6 weeks)17
Reflects downloads up to 11 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Leveraging Phylogenetics in Software Product Families: The Case of Latent Content Generation in Video GamesProceedings of the 28th ACM International Systems and Software Product Line Conference10.1145/3646548.3672596(113-124)Online publication date: 2-Sep-2024
  • (2024)An Empirical Study of Code Clones: Density, Entropy, and PatternsScience of Computer Programming10.1016/j.scico.2024.103259(103259)Online publication date: Dec-2024
  • (2023)Natural Language to Code: How Far Are We?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616323(375-387)Online publication date: 30-Nov-2023
  • (2023)Big Code Search: A BibliographyACM Computing Surveys10.1145/360490556:1(1-49)Online publication date: 26-Aug-2023
  • (2023)DAISY: Dynamic-Analysis-Induced Source Discovery for Sensitive DataACM Transactions on Software Engineering and Methodology10.1145/356993632:4(1-34)Online publication date: 27-May-2023
  • (2023)Using Transfer Learning for Code-Related TasksIEEE Transactions on Software Engineering10.1109/TSE.2022.318329749:4(1580-1598)Online publication date: 1-Apr-2023
  • (2023)A Statistical Method for API Usage Learning and API Misuse Violation Finding2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)10.1109/SERA57763.2023.10197708(358-365)Online publication date: 23-May-2023
  • (2023)Large Language Models for Software Engineering: Survey and Open Problems2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE)10.1109/ICSE-FoSE59343.2023.00008(31-53)Online publication date: 14-May-2023
  • (2023)Using the Deep Learning-Based Approaches for Program Debugging and Repair2023 10th International Conference on Dependable Systems and Their Applications (DSA)10.1109/DSA59317.2023.00059(443-454)Online publication date: 10-Aug-2023
  • (2023)Memorization and generalization in neural code intelligence modelsInformation and Software Technology10.1016/j.infsof.2022.107066153:COnline publication date: 1-Jan-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media