ABSTRACT
Studies have shown that substantial code reuse is common in open source and in commercial projects. However, the precise extent of reuse and its impact on productivity and quality are not well investigated in the open source context. Previously, we have introduced a simple-to-use method that needs only a set of file pathnames to identifies directories that share filenames and partially validated its performance on a set of closed-source projects. To evaluate this method and to improve reuse detection at the file level, we apply it and four additional file copy detection methods that utilize the underlying content of multiple versions of the source code on the FreeBSD project. The evaluation quantified unique advantages of each method and showed that the filename method detected roughly half of all reuse cases. We are still faced with a challenge to scale the content based methods to large repositories containing all versions of open source files.
- Brenda Baker. On finding duplication and near duplication in large software system, IEEE Working Conference on Reverse Engineering 1995. Google ScholarDigital Library
- B. Lague, D. Proulx, E. Merlo, J. Maryland, J. Hudepohl, Assessing the benefits of incorporating function clone detection in a development process, IEEE International Conference on Software Maintenance 1997. Google ScholarDigital Library
- Akito Monden, Daikai Nakae, Toshihiro Kamiya, Shin-ichi Sato and Ken-ichi Matsumoto. Software quality analysis by code clones in industrial legacy software, Proceedings of the 8th International Symposium on Software Metrics 2002. Google ScholarDigital Library
- Stefan Haefliger, Georg von Krogh and Sebastian Spaeth. Code reuse in open source software. Management Science, Articles in Advance, pp. 1--14. Google ScholarDigital Library
- Hung-Fu Chang and Audris Mockus. Constructing universal version history. ICSE'06 Workshop on Mining Software Repositories, pp. 76--79, Shanghai, China, May 22--23 2006. Google ScholarDigital Library
- E. Damiani, S. De Capitani di Vimercati, S. Paraboschi, P. Samarati. An Open Digest-based Technique for Spam Detection. ACM, vol. 41, no. 8, pp. 74--83. The 2004 International Workshop on Security in Parallel and Distributed Systems.Google Scholar
- Michael W. Barry and Murray Browne. Understanding search engines: mathematical modeling and text retrieval. SIAM 1999. Google ScholarDigital Library
- Ira Baxter, Andrew Yahin, Leonardo Moura, Marcelo SantAnna and Lorraine Bier. Clone detection using abstract syntax trees. In Proceedings of the 8th International Symposium on Software Metrics 1998. Google ScholarDigital Library
- S. Ducasse, M. Rieger, and S. Demeyer. A language independent approach for detecting duplicated code. International Conference on Software Maintenance 1999. Google ScholarDigital Library
- T. Kamiya, S. Kusumoto, and K. Inoue. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans. Software Engineering, Vol. 28, No.7, 2002. Google ScholarDigital Library
- Audris Mockus. Large-scale code reuse in open source software. International Workshop on Emerging Trends in FLOSS Research and Development, May 20--26 2007. Google ScholarDigital Library
- Daniel M. German. Using Software Distributions to Understand the Relationship among Free and Open Source Software Projects.ICSE'07 Workshop on Mining Software Repositories, pp.24, 2007. Google ScholarDigital Library
- Cory Kapser and Michael W. Godfrey. Improved tool support for the investigation of duplication in software. International Conference on Software Maintenance 2005. Google ScholarDigital Library
- Stefan Bellon, Rainer Koschke, Giulio Antoniol, Jens Krinke, Ettore Merlo. Comparison and Evaluation of Clone Detection Tools. IEEE Transactions on Software Engineering, vol. 33, no. 9, pp.577--591, Sep., 2007. Google ScholarDigital Library
- Michael W. Godfrey, Lijie Zou. Using Origin Analysis to Detect Merging and Splitting of Source Code Entities. IEEE Transactions on Software Engineering, vol. 31, no. 2, pp.166--181, Feb., 2005 Google ScholarDigital Library
Index Terms
Evaluation of source code copy detection methods on freebsd
Recommendations
Constructing universal version history
MSR '06: Proceedings of the 2006 international workshop on Mining software repositoriesDevelopers often copy code for parts or entire products to start a new product or a new release. In order to understand the software change history and to determine the code authorship, we propose to construct a universal version history from multiple ...
DebCheck: Efficient Checking for Open Source Code Clones in Software Systems
ICPC '11: Proceedings of the 2011 IEEE 19th International Conference on Program ComprehensionThe problem of finding code cloned from open source code in software systems is of interest both to the open source community (e.g., for GPL and other open source license enforcement) and the industrial community (e.g., to prevent GPL "contamination" of ...
Version history based source code plagiarism detection in proprietary systems
ICSM '12: Proceedings of the 2012 IEEE International Conference on Software Maintenance (ICSM)While the advent of open source code search tools have made the source code of thousands of open source software (OSS) readily accessible, thereby increasing legitimate reuse, it has also opened up the possibility of unconscientious employees ...
Comments