skip to main content
10.1145/2934732.2934738acmotherconferencesArticle/Chapter ViewAbstractPublication PagesceriConference Proceedingsconference-collections
research-article

Detecting Source Code Re-Use with Ensemble Models

Published: 14 June 2016 Publication History

Abstract

Source code re-use has been usually faced from a compiler perspective. Considering the source code as a piece of text, we are able to use natural language techniques for the detection of source code re-use. This paper describes the use of ensemble models in the task of source code re-use detection. Ensembles of Information Retrieval (IR) models are constructed using common classifiers. The IR-inspired models are compared with the ensembles in C and Java programming languages. The use of ensemble classifiers shows promising results for detecting source code re-use.

References

[1]
V. Anjali, T. Swapna, and B. Jayaraman. Plagiarism detection for Java programs without source codes. Procedia Computer Science, 46:749--758, 2015.
[2]
C. Arwin and S. Tahaghoghi. Plagiarism detection across programming languages. Proceedings of the 29th Australian Computer Science Conference, Australian Computer Society, 48:277--286, 2006.
[3]
N. Baer and R. Zeidman. Measuring whitespace pattern sequence as an indication of plagiarism. Journal of Software Engineering and Applications, 5(4):249--254, 2012.
[4]
R. A. Baeza-Yates and B. A. Ribeiro-Neto. Modern Information Retrieval - the concepts and technology behind search, Second edition. Pearson Education Ltd., Harlow, England, 2011.
[5]
A. Barrón-Cedeño, M. Lestari-Paramita, P. Clough, and P. Rosso. A comparison of approaches for measuring cross-lingual similarity of Wikipedia articles. Advances in Information Retrieval, Springer International Publishing, LNCS(8416), pages 424--429, 2014.
[6]
D.-K. Chae, J. Ha, S.-W. Kim, B. Kang, and E. G. Im. Software plagiarism detection: a graph-based approach. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 1577--1580. ACM, 2013.
[7]
M. Chilowicz, E. Duris, and G. Roussel. Syntax tree fingerprinting for source code similarity detection. In IEEE 17th International Conference on Program Comprehension, pages 243--247, May 2009.
[8]
D. Chuda, P. Navrat, B. Kovacova, and P. Humay. The issue of (software) plagiarism: A student view. IEEE Transactions on Education, 55(1):22--28, 2012.
[9]
G. Cosma and M. Joy. Evaluating the performance of LSA for source-code plagiarism detection. Informatica, 36(4):409--424, 2013.
[10]
B. Cui, J. Li, T. Guo, J. Wang, and D. Ma. Code comparison system based on abstract syntax tree. In 3rd IEEE International Conference on Broadband Network and Multimedia Technology, pages 668--673, Oct 2010.
[11]
J. Cullum and R. Willoughby. Lanczos Algorithms for Large Symmetric Eigenvalue Computations. Society for Industrial and Applied Mathematics, 2002.
[12]
S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6):391--407, 1990.
[13]
J. L. Fleiss. Measuring nominal scale agreement among many raters. Psychological bulletin, 76(5):378, 1971.
[14]
E. Flores, A. Barrón-Cedeño, L. Moreno, and P. Rosso. Cross-language source code re-use detection using latent semantic analysis. Journal of Universal Computer Science, 21(13):1708--1725, dec 2015. http://www.jucs.org/jucs_21_13/cross_language_source_code.
[15]
E. Flores, A. Barrón-Cedeño, L. Moreno, and P. Rosso. Uncovering source code reuse in large-scale academic environments. Computer Applications in Engineering Education, 23(3):383--390, 2015.
[16]
E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. DeSoCoRe: Detecting Source Code Re-use across programming languages. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1--4. Association for Computational Linguistics, 2012.
[17]
E. Flores, A. Barrón-Cedeño, P. Rosso, and L. Moreno. Towards the Detection of Cross-Language Source Code Reuse. Proceedings of 16th International Conference on Applications of Natural Language to Information Systems, Springer-Verlag, LNCS(6716), pages 250--253, 2011.
[18]
E. Flores, P. Rosso, L. Moreno, and E. Villatoro-Tello. PAN@FIRE: Overview of SOCO track on the detection of SOurce COde re-use. In Prasenjit et al. {23}.
[19]
E. Gabrilovich and S. Markovitch. Computing semantic relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th international joint conference on Artifical intelligence, volume 7, pages 1606--1611, 2007.
[20]
R. García-Hernández and Y. Lendeneva. Identification of similar source codes based on longest common substrings. In Prasenjit et al. {23}.
[21]
M. Joy and M. Luck. Plagiarism in programming assignments. IEEE Transactions on Education, 42(2):129--133, May 1999.
[22]
P. Mcnamee and J. Mayfield. Character n-gram tokenization for european language text retrieval. Information Retrieval, 7(1-2):73--97, 2004.
[23]
M. Prasenjit, M. Mandar, P. Sukomal, A. Madhulika, and M. Parth, editors. FIRE 2014 Working Notes. Sixth International Workshop of the Forum for Information Retrieval Evaluation, Bangalore, India, 5-7 December, 2014.
[24]
L. Prechelt, G. Malpohl, and M. Philippsen. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science, 8(11):1016--1038, 2002.
[25]
A. Ramírez-de-la Cruz, G. Ramírez-de-la Rosa, C. Sánchez-Sánchez, W. A. Luna-Ramírez, H. Jiménez-Salazar, and C. Rodríguez-Lucatero. UAM@SOCO 2014: Detection of source code reuse by means of combining different types of representations. In Prasenjit et al. {23}.
[26]
M. Simard, G. Foster, and P. Isabelle. Using cognates to align sentences in bilingual corpora. In Proceedings of the Conference Centre for Advanced Studies on Collaborative research: Distributed Computing, IBM Press, volume 2, pages 1071--1082, 1993.
[27]
G. Whale. Software metrics and plagiarism detection. Journal of Systems and Software, 13(2):131--138, 1990.

Cited By

View all
  • (2022)Unification of Source-Code Re-Use Similarity MeasuresAdvances in Computational Intelligence10.1007/978-3-031-19493-1_31(397-409)Online publication date: 23-Oct-2022
  • (2021)An intelligent decision support system for software plagiarism detection in academiaInternational Journal of Intelligent Systems10.1002/int.22399Online publication date: 21-Feb-2021
  • (2019)CORESACM Transactions on Storage10.1145/332170415:3(1-46)Online publication date: 26-Jun-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
CERI '16: Proceedings of the 4th Spanish Conference on Information Retrieval
June 2016
146 pages
ISBN:9781450341417
DOI:10.1145/2934732
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

In-Cooperation

  • University of Granada: University of Granada

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ensemble classifiers
  2. re-use detection
  3. source code re-use
  4. source code re-use retrieval

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

CERI '16

Acceptance Rates

CERI '16 Paper Acceptance Rate 18 of 27 submissions, 67%;
Overall Acceptance Rate 36 of 51 submissions, 71%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2022)Unification of Source-Code Re-Use Similarity MeasuresAdvances in Computational Intelligence10.1007/978-3-031-19493-1_31(397-409)Online publication date: 23-Oct-2022
  • (2021)An intelligent decision support system for software plagiarism detection in academiaInternational Journal of Intelligent Systems10.1002/int.22399Online publication date: 21-Feb-2021
  • (2019)CORESACM Transactions on Storage10.1145/332170415:3(1-46)Online publication date: 26-Jun-2019
  • (2019)Source-code Similarity Detection and Detection Tools Used in AcademiaACM Transactions on Computing Education10.1145/331329019:3(1-37)Online publication date: 21-May-2019
  • (2019)Incorporating Computing Professionals’ Know-howACM Transactions on Computing Education10.1145/330915719:3(1-18)Online publication date: 21-May-2019
  • (2019)Computer Science Pedagogical Content KnowledgeACM Transactions on Computing Education10.1145/330377019:3(1-24)Online publication date: 21-May-2019
  • (2019)A Framework for Teaching Security Design Analysis Using Case Studies and the Hybrid Flipped ClassroomACM Transactions on Computing Education10.1145/328923819:3(1-19)Online publication date: 16-Jan-2019
  • (2019)Equitable Learning Environments in K-12 ComputingACM Transactions on Computing Education10.1145/328293919:3(1-16)Online publication date: 30-Jan-2019
  • (2019)Learning IS Child’s PlayACM Transactions on Computing Education10.1145/328284419:3(1-18)Online publication date: 16-Jan-2019
  • (2019)Does Computer Game Design and Programming Benefit Children? A Meta-Synthesis of ResearchACM Transactions on Computing Education10.1145/327756519:3(1-35)Online publication date: 16-Jan-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media