research-article

A study of the uniqueness of source code

Authors:

Mark Gabel,

Zhendong SuAuthors Info & Claims

FSE '10: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering

Pages 147 - 156

https://doi.org/10.1145/1882291.1882315

Published: 07 November 2010 Publication History

Get Access

Abstract

This paper presents the results of the first study of the uniqueness of source code. We define the uniqueness of a unit of source code with respect to the entire body of written software, which we approximate with a corpus of 420 million lines of source code. Our high-level methodology consists of examining a collection of 6,000 software projects and measuring the degree to which each project can be `assembled' solely from portions of this corpus, thus providing a precise measure of `uniqueness' that we call syntactic redundancy. We parameterized our study over a variety of variables, the most important of which being the level of granularity at which we view source code. Our suite of experiments together consumed approximately four months of CPU time, providing quantitative answers to the following questions: at what levels of granularity is software unique, and at a given level of granularity, how unique is software? While we believe these questions to be of intrinsic interest, we discuss possible applications to genetic programming and developer productivity tools.

References

[1]

R. Al-Ekram, C. Kapser, R. C. Holt, and M. W. Godfrey. Cloning by accident: an empirical study of source code cloning across software systems. In ISESE, pages 376--385, 2005.

Crossref

Google Scholar

[2]

S. Bajracharya, T. Ngo, E. Linstead, Y. Dou, P. Rigor, P. Baldi, and C. Lopes. Sourcerer: a search engine for open source code supporting structure-based search. In OOPSLA '06: Companion to the 21st ACM SIGPLAN symposium on Object-oriented programming systems, languages, and applications, pages 681--682, 2006.

Digital Library

Google Scholar

[3]

B. S. Baker. On finding duplication and near-duplication in large software systems. In WCRE '95: Proceedings of the Second Working Conference on Reverse Engineering, page 86, 1995.

Digital Library

Google Scholar

[4]

R. Cottrell, R. J. Walker, and J. Denzinger. Semi-automating small-scale source code reuse via structural correspondence. In SIGSOFT '08/FSE-16: Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pages 214--225, 2008.

Digital Library

Google Scholar

[5]

P. Devanbu, S. Karstu, W. Melo, and W. Thomas. Analytical and empirical evaluation of software reuse metrics. In ICSE '96: Proceedings of the 18th international conference on Software engineering, pages 189--199, 1996.

Digital Library

Google Scholar

[6]

A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB '99: Proceedings of the 25th International Conference on Very Large Data Bases, pages 518--529, 1999.

Digital Library

Google Scholar

[7]

L. Jiang, G. Misherghi, Z. Su, and S. Glondu. DECKARD: Scalable and accurate tree-based detection of code clones. In Proceedings of ICSE, pages 96--105, 2007.

Digital Library

Google Scholar

[8]

L. Jiang and Z. Su. Automatic mining of functionally equivalent code fragments via random testing. In Proceedings of ISSTA, pages 81--92, 2009.

Digital Library

Google Scholar

[9]

T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Softw. Eng., 28(7):654--670, 2002.

Digital Library

Google Scholar

[10]

J. R. Koza. Genetic programming: on the programming of computers by means of natural selection. MIT Press, Cambridge, MA, USA, 1992.

Digital Library

Google Scholar

[11]

R. V. Krejcie and D. W. Morgan. Determining sample size for research activities. Educational and psychological measurement, 30:607--610, 1970.

Google Scholar

[12]

O. A. L. Lemos, S. K. Bajracharya, J. Ossher, R. S. Morla, P. C. Masiero, P. Baldi, and C. V. Lopes. CodeGenie: using test-cases to search and reuse source code. In Proceedings of ASE, 2007.

Digital Library

Google Scholar

[13]

S. Livieri, Y. Higo, M. Matushita, and K. Inoue. Very-large scale code clone analysis and visualization of open source programs using distributed CCFinder: D-CCFinder. In Proceedings of ICSE, pages 106--115, 2007.

Digital Library

Google Scholar

[14]

A. Mockus. Large-scale code reuse in open source software. In FLOSS '07: Proceedings of the First International Workshop on Emerging Trends in FLOSS Research and Development, page 7, 2007.

Digital Library

Google Scholar

[15]

A. Mockus. Amassing and indexing a large sample of version control systems: Towards the census of public source code history. In MSR '09: Proceedings of the 2009 6th IEEE International Working Conference on Mining Software Repositories, pages 11--20, 2009.

Digital Library

Google Scholar

[16]

A. Nandi and H. V. Jagadish. Effective phrase prediction. In VLDB '07: Proceedings of the 33rd international conference on Very large data bases, pages 219--230, 2007.

Digital Library

Google Scholar

[17]

A. Podgurski and L. Pierce. Retrieving reusable software by sampling behavior. ACM Trans. Softw. Eng. Methodol., 2(3):286--303, 1993.

Digital Library

Google Scholar

[18]

S. P. Reiss. Semantics-based code search. In ICSE '09: Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, pages 243--253, 2009.

Digital Library

Google Scholar

[19]

S. Schleimer, D. S. Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In SIGMOD '03: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 76--85, 2003.

Digital Library

Google Scholar

[20]

W. Weimer, T. Nguyen, C. Le Goues, and S. Forrest. Automatically finding patches using genetic programming. In ICSE '09: Proceedings of the 2009 IEEE 31st International Conference on Software Engineering, pages 364--374, 2009.

Digital Library

Google Scholar

Cited By

View all

Chueca JBlasco DCetina CFont J(2024)Leveraging Phylogenetics in Software Product Families: The Case of Latent Content Generation in Video GamesProceedings of the 28th ACM International Systems and Software Product Line Conference10.1145/3646548.3672596(113-124)Online publication date: 2-Sep-2024
https://dl.acm.org/doi/10.1145/3646548.3672596
Hu BYu DWu YHu TCai Y(2024)An Empirical Study of Code Clones: Density, Entropy, and PatternsScience of Computer Programming10.1016/j.scico.2024.103259(103259)Online publication date: Dec-2024
https://doi.org/10.1016/j.scico.2024.103259
Wang SGeng MLin BSun ZWen MLiu YLi LBissyandé TMao XChandra SBlincoe KTonella P(2023)Natural Language to Code: How Far Are We?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616323(375-387)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616323
Show More Cited By

Index Terms

A study of the uniqueness of source code

Recommendations

Do bugs lead to unnaturalness of source code?
ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering

Texts in natural languages are highly repetitive and predictable because of the naturalness of natural languages. Recent research validated that source code in programming languages is also repetitive and predictable, and naturalness is an inherent ...
A Framework for Source Code Search Using Program Patterns

For maintainers involved in understanding and reengineering large software, locating source code fragments that match certain patterns is a critical task. Existing solutions to the problem are few, and they either involve manual, painstaking scans of ...
Source Code Interoperability based on Ontology
SBSI '21: Proceedings of the XVII Brazilian Symposium on Information Systems

The different ways of representing a source code in different programming languages create a heterogeneous context. In addition, the use of multiple programming languages in a single source code (polyglot programming) brings a wide choice of terms from ...

Reviews

Reviewer: Nancy S. Eickelmann

Gabel and Su propose syntactic redundancy as a new measure of the degree of software uniqueness. Syntactic redundancy is the percentage of matches between a user-selected code sequence of specific token composition and two sets of selected open-source projects. The efficacy of the methodology and validity of the results are not well documented. The study computes the syntactic redundancy percentage for two samples from 6,000 SourceForge projects: a sample of 30 SourceForge projects and a sample of a nondisclosed size. The latter sample is obtained using a uniform sampling strategy; the number of tokens used is stated as 1,500, which equates to about 250 lines of code. The authors imply that the study's findings extend-due to their generalizability-to all 6,000 projects based on the sample that was evaluated, despite their statement that "[they] have not formally formulated/tested any hypotheses, and [they] have not performed any statistical tests." The study has two independent variables: the number of tokens included in a matching sequence, designated as g (the level of granularity), and matching criteria at one of two values, with no abstractions or renamed identifiers. The matches are measured by documenting matches with Hamming distances for binary variables with distances of one to four. The results of the study, in all but one experiment, found no significant amount of redundancy at levels of g over 60. This equates to approximately ten lines of code. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Information & Contributors

Information

Published In

FSE '10: Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering

November 2010

302 pages

ISBN:9781605587912

DOI:10.1145/1882291

General Chair:
Gruia-Catalin Roman
Washington University in St. Louis, USA
,
Program Chair:
André van der Hoek
University of California, Irvine, USA

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2010

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGSOFT/FSE'10

Sponsor:

SIGSOFT

SIGSOFT/FSE'10: 18th ACM SIGSOFT Symposium on the Foundations of Software Engineering

November 7 - 11, 2010

New Mexico, Santa Fe, USA

Acceptance Rates

Overall Acceptance Rate 17 of 128 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

154
Total Citations
View Citations
1,012
Total Downloads

Downloads (Last 12 months)67
Downloads (Last 6 weeks)17

Reflects downloads up to 11 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

View all

Chueca JBlasco DCetina CFont J(2024)Leveraging Phylogenetics in Software Product Families: The Case of Latent Content Generation in Video GamesProceedings of the 28th ACM International Systems and Software Product Line Conference10.1145/3646548.3672596(113-124)Online publication date: 2-Sep-2024
https://dl.acm.org/doi/10.1145/3646548.3672596
Hu BYu DWu YHu TCai Y(2024)An Empirical Study of Code Clones: Density, Entropy, and PatternsScience of Computer Programming10.1016/j.scico.2024.103259(103259)Online publication date: Dec-2024
https://doi.org/10.1016/j.scico.2024.103259
Wang SGeng MLin BSun ZWen MLiu YLi LBissyandé TMao XChandra SBlincoe KTonella P(2023)Natural Language to Code: How Far Are We?Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering10.1145/3611643.3616323(375-387)Online publication date: 30-Nov-2023
https://dl.acm.org/doi/10.1145/3611643.3616323
Kim KGhatpande SKim DZhou XLiu KBissyandé TKlein JLe Traon Y(2023)Big Code Search: A BibliographyACM Computing Surveys10.1145/360490556:1(1-49)Online publication date: 26-Aug-2023
https://dl.acm.org/doi/10.1145/3604905
Zhang XHeaps JSlavin RNiu JBreaux TWang X(2023)DAISY: Dynamic-Analysis-Induced Source Discovery for Sensitive DataACM Transactions on Software Engineering and Methodology10.1145/356993632:4(1-34)Online publication date: 27-May-2023
https://dl.acm.org/doi/10.1145/3569936
Mastropaolo ACooper NPalacio DScalabrino SPoshyvanyk DOliveto RBavota G(2023)Using Transfer Learning for Code-Related TasksIEEE Transactions on Software Engineering10.1109/TSE.2022.318329749:4(1580-1598)Online publication date: 1-Apr-2023
https://doi.org/10.1109/TSE.2022.3183297
Panda DBasia PNallavolu KZhong XSiy HSong M(2023)A Statistical Method for API Usage Learning and API Misuse Violation Finding2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)10.1109/SERA57763.2023.10197708(358-365)Online publication date: 23-May-2023
https://doi.org/10.1109/SERA57763.2023.10197708
Fan AGokkaya BHarman MLyubarskiy MSengupta SYoo SZhang J(2023)Large Language Models for Software Engineering: Survey and Open Problems2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE)10.1109/ICSE-FoSE59343.2023.00008(31-53)Online publication date: 14-May-2023
https://doi.org/10.1109/ICSE-FoSE59343.2023.00008
Lin THuang CFang C(2023)Using the Deep Learning-Based Approaches for Program Debugging and Repair2023 10th International Conference on Dependable Systems and Their Applications (DSA)10.1109/DSA59317.2023.00059(443-454)Online publication date: 10-Aug-2023
https://doi.org/10.1109/DSA59317.2023.00059
Rabin MHussain AAlipour MHellendoorn V(2023)Memorization and generalization in neural code intelligence modelsInformation and Software Technology10.1016/j.infsof.2022.107066153:COnline publication date: 1-Jan-2023
https://dl.acm.org/doi/10.1016/j.infsof.2022.107066
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Do bugs lead to unnaturalness of source code?

A Framework for Source Code Search Using Program Patterns

Source Code Interoperability based on Ontology

Reviews

Access critical reviews of Computing literature here