skip to main content
10.1145/1368088.1368132acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

Scalable detection of semantic clones

Published: 10 May 2008 Publication History

Abstract

Several techniques have been developed for identifying similar code fragments in programs. These similar fragments, referred to as code clones, can be used to identify redundant code, locate bugs, or gain insight into program design. Existing scalable approaches to clone detection are limited to finding program fragments that are similar only in their contiguous syntax. Other, semantics-based approaches are more resilient to differences in syntax, such as reordered statements, related statements interleaved with other unrelated statements, or the use of semantically equivalent control structures. However, none of these techniques have scaled to real world code bases. These approaches capture semantic information from Program Dependence Graphs (PDGs), program representations that encode data and control dependencies between statements and predicates. Our definition of a code clone is also based on this representation: we consider program fragments with isomorphic PDGs to be clones.
In this paper, we present the first scalable clone detection algorithm based on this definition of semantic clones. Our insight is the reduction of the difficult graph similarity problem to a simpler tree similarity problem by mapping carefully selected PDG subgraphs to their related structured syntax. We efficiently solve the tree similarity problem to create a scalable analysis. We have implemented this algorithm in a practical tool and performed evaluations on several million-line open source projects, including the Linux kernel. Compared with previous approaches, our tool locates significantly more clones, which are often more semantically interesting than simple copied and pasted code fragments.

References

[1]
G. Ammons, R. Bodík, and J. R. Larus. Mining specifications. In Proceedings of POPL, 2002.
[2]
H. A. Basit and S. Jarzabek. Detecting higher-level similarity patterns in programs. In ESEC/FSE, 2005.
[3]
I. D. Baxter, A. Yahin, L. Moura, M. Sant'Anna, and L. Bier. Clone detection using abstract syntax trees. In ICSM, 1998.
[4]
S. Bellon. Comparison and evaluation of clone detection tools. IEEE Trans. Softw. Eng., 33(9), 2007. Member-Rainer Koschke and Member-Giulio Antoniol and Member-Jens Krinke and Member-Ettore Merlo.
[5]
M. Bruntink, A. van Deursen, T. Tourwe, and R. van Engelen. An evaluation of clone detection techniques for identifying crosscutting concerns. In Proceedings of ICSM, 2004.
[6]
J. Ferrante, K. J. Ottenstein, and J. D. Warren. The program dependence graph and its use in optimization. ACM Trans. Program. Lang. Syst., 9(3), 1987.
[7]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999.
[8]
S. Horwitz, T. Reps, and D. Binkley. Interprocedural slicing using dependence graphs. ACM Trans. Program. Lang. Syst., 12(1), 1990.
[9]
L. Jiang, G. Misherghi, Z. Su, and S. Glondu. Deckard: Scalable and accurate tree--based detection of code clones. In Proceedings of ICSE, 2007.
[10]
T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token--based code clone detection system for large scale source code. TSE, 28(7), 2002.
[11]
R. Komondoor and S. Horwitz. Semantics-preserving procedure extraction. In POPL, 2000.
[12]
R. Komondoor and S. Horwitz. Using slicing to identify duplication in source code. In SAS, 2001.
[13]
K. Kontogiannis, R. de Mori, E. Merlo, M. Galler, and M. Bernstein. Pattern matching for clone and concept detection. Automated Soft. Eng., 3(1/2), 1996.
[14]
Z. Li, S. Lu, S. Myagmar, and Y. Zhou. CP-Miner: A tool for finding copy-paste and related bugs in operating system code. In OSDI, 2004.
[15]
Z. Li and Y. Zhou. PR-Miner: Automatically extracting implicit programming rules and detecting violations in large software code. In ESEC/FSE, 2005.
[16]
C. Liu, C. Chen, J. Han, and P. S. Yu. GPLAG: detection of software plagiarism by program dependence graph analysis. In KDD '06: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006.
[17]
J. Mayrand, C. Leblanc, and E. Merlo. Experiment on the automatic detection of function clones in a software system using metrics. In ICSM, 1996.
[18]
S. Schleimer, D. S.Wilkerson, and A. Aiken. Winnowing: local algorithms for document fingerprinting. In SIGMOD, 2003.
[19]
V. Wahler, D. Seipel, J. W. von Gudenberg, and G. Fischer. Clone detection in source code by frequent itemset techniques. In SCAM, 2004.
[20]
M. Weiser. Program slicing. In ICSE '81: Proceedings of the 5th international conference on Software engineering, 1981.
[21]
R. Yang, P. Kalnis, and A. K. H. Tung. Similarity evaluation on tree-structured data. In SIGMOD, 2005.

Cited By

View all
  • (2024)SolaSim: Clone Detection for Solana Smart Contracts via Program RepresentationProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644406(258-269)Online publication date: 15-Apr-2024
  • (2024)Prism: Decomposing Program Semantics for Code Clone Detection through CompilationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639129(1-13)Online publication date: 20-May-2024
  • (2024)Semantic Code Clone Detection Based on Community DetectionInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402450032334:10(1661-1692)Online publication date: 26-Jul-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICSE '08: Proceedings of the 30th international conference on Software engineering
May 2008
558 pages
ISBN:9781605580791
DOI:10.1145/1368088
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 May 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clone detection
  2. program dependence graph
  3. refactoring
  4. software maintenance

Qualifiers

  • Research-article

Conference

ICSE '08
Sponsor:

Acceptance Rates

ICSE '08 Paper Acceptance Rate 56 of 370 submissions, 15%;
Overall Acceptance Rate 276 of 1,856 submissions, 15%

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)3
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)SolaSim: Clone Detection for Solana Smart Contracts via Program RepresentationProceedings of the 32nd IEEE/ACM International Conference on Program Comprehension10.1145/3643916.3644406(258-269)Online publication date: 15-Apr-2024
  • (2024)Prism: Decomposing Program Semantics for Code Clone Detection through CompilationProceedings of the IEEE/ACM 46th International Conference on Software Engineering10.1145/3597503.3639129(1-13)Online publication date: 20-May-2024
  • (2024)Semantic Code Clone Detection Based on Community DetectionInternational Journal of Software Engineering and Knowledge Engineering10.1142/S021819402450032334:10(1661-1692)Online publication date: 26-Jul-2024
  • (2024)VarGAN: Adversarial Learning of Variable Semantic RepresentationsIEEE Transactions on Software Engineering10.1109/TSE.2024.339173050:6(1505-1517)Online publication date: 25-Apr-2024
  • (2024)Code Clone Detection using Machine Learning: Brief Overview and Latest Developments2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10725802(1-6)Online publication date: 24-Jun-2024
  • (2024)A Generative AI-Driven Method-Level Semantic Clone Detection Based on the Structural and Semantical Comparison of MethodsIEEE Access10.1109/ACCESS.2024.340177012(70773-70791)Online publication date: 2024
  • (2024)Evaluation of Code Similarity Search Strategies in Large-Scale CodebasesTransactions on Large-Scale Data- and Knowledge-Centered Systems LVII10.1007/978-3-662-70140-9_4(99-113)Online publication date: 25-Oct-2024
  • (2024)Source Code Clone Detection Using Unsupervised Similarity MeasuresSoftware Quality as a Foundation for Security10.1007/978-3-031-56281-5_2(21-37)Online publication date: 12-Apr-2024
  • (2023)Cloning and Beyond: A Quantum Solution to Duplicate CodeProceedings of the 2023 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software10.1145/3622758.3622889(32-49)Online publication date: 18-Oct-2023
  • (2023)sem2vec: Semantics-aware Assembly Tracelet EmbeddingACM Transactions on Software Engineering and Methodology10.1145/356993332:4(1-34)Online publication date: 27-May-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media