ABSTRACT
Measuring code similarity is fundamental for many software engineering tasks, e.g., code search, refactoring and reuse. However, most existing techniques focus on code syntactical similarity only, while measuring code functional similarity remains a challenging problem. In this paper, we propose a novel approach that encodes code control flow and data flow into a semantic matrix in which each element is a high dimensional sparse binary feature vector, and we design a new deep learning model that measures code functional similarity based on this representation. By concatenating hidden representations learned from a code pair, this new model transforms the problem of detecting functionally similar code to binary classification, which can effectively learn patterns between functionally similar code with very different syntactics.
We have implemented our approach, DeepSim, for Java programs and evaluated its recall, precision and time performance on two large datasets of functionally similar code. The experimental results show that DeepSim significantly outperforms existing state-of-the-art techniques, such as DECKARD, RtvNN, CDLH, and two baseline deep neural networks models.
- 2006. T.J. Watson Libraries for Analysis (WALA). http://wala.sourceforge.net/ wiki/index.php/Main_Page/. Accessed: 2016-09-02. 2016. Google Code Jam. https://code.google.com/codejam/contests.html. Accessed: 2016-10-08. 2016. TensorFlow: An open-source software library for Machine Intelligence. https://www.tensorflow.org/. Accessed: 2017-08-02. 2017. Deckard Github repo. https://github.com/skyhover/Deckard. Accessed: May/2017.Google Scholar
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).Google Scholar
- Brenda S Baker. 1993. A program for identifying duplicated code. Computing Science and Statistics (1993), 49–49.Google Scholar
- Brenda S Baker. 1996. Parameterized pattern matching: Algorithms and Applications. J. Comput. System Sci. 52, 1 (1996), 28–42. Google ScholarDigital Library
- Matej Balog, Alexander L Gaunt, Marc Brockschmidt, Sebastian Nowozin, and Daniel Tarlow. 2016. Deepcoder: Learning to write programs. arXiv preprint arXiv:1611.01989 (2016).Google Scholar
- Earl T Barr, Yuriy Brun, Premkumar Devanbu, Mark Harman, and Federica Sarro. 2014. The plastic surgery hypothesis. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 306–317. Google ScholarDigital Library
- Ira D Baxter, Andrew Yahin, Leonardo Moura, Marcelo Sant’Anna, and Lorraine Bier. 1998. Clone detection using abstract syntax trees. In Software Maintenance, 1998. Proceedings., International Conference on. IEEE, 368–377. DeepSim: Deep Learning Code Functional Similarity ESEC/FSE ’18, November 4–9, 2018, Lake Buena Vista, FL, USA Google ScholarDigital Library
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of Machine Learning Research 3, Feb (2003), 1137–1155. Google Scholar
- S Carter, RJ Frank, and DSW Tansley. 1993. Clone detection in telecommunications software systems: A neural net approach. In Proc. Int. Workshop on Application of Neural Networks to Telecommunications. 273–287.Google Scholar
- Kai Chen, Peng Liu, and Yingjun Zhang. 2014. Achieving accuracy and scalability simultaneously in detecting application clones on android markets. In Proceedings of the 36th International Conference on Software Engineering. ACM, 175–186. Google ScholarDigital Library
- Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015).Google Scholar
- Jonathan Crussell, Clint Gibler, and Hao Chen. 2012. Attack of the clones: Detecting cloned applications on android markets. In European Symposium on Research in Computer Security. Springer, 37–54.Google ScholarCross Ref
- Florian Deissenboeck, Lars Heinemann, Benjamin Hummel, and Stefan Wagner. 2012. Challenges of the dynamic detection of functionally similar code fragments. In Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on. IEEE, 299–308. Google ScholarDigital Library
- Mark Gabel, Lingxiao Jiang, and Zhendong Su. 2008. Scalable detection of semantic clones. In 2008 ACM/IEEE 30th International Conference on Software Engineering. IEEE, 321–330. Google ScholarDigital Library
- Mark Gabel and Zhendong Su. 2010. A study of the uniqueness of source code. In Proceedings of the eighteenth ACM SIGSOFT international symposium on Foundations of software engineering. ACM, 147–156. Google ScholarDigital Library
- Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. 2014. Deep speech: Scaling up end-to-end speech recognition. arXiv preprint arXiv:1412.5567 (2014).Google Scholar
- Yoshiki Higo, Toshihiro Kamiya, Shinji Kusumoto, Katsuro Inoue, and K Words. 2004. ARIES: Refactoring support environment based on code clone analysis.. In IASTED Conf. on Software Engineering and Applications. 222–229.Google Scholar
- Geoffrey E Hinton and Ruslan R Salakhutdinov. 2006. Reducing the dimensionality of data with neural networks. Science 313, 5786 (2006), 504–507.Google Scholar
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780. Google ScholarDigital Library
- Reid Holmes and Gail C Murphy. 2005. Using structural context to recommend source code examples. In Proceedings of the 27th International Conference on Software Engineering. ACM, 117–125. Google ScholarDigital Library
- K. Hornik, M. Stinchcombe, and H. White. 1989. Multilayer Feedforward Networks Are Universal Approximators. Neural Netw. 2, 5 (July 1989), 359–366. Google ScholarDigital Library
- Benjamin Hummel, Elmar Juergens, Lars Heinemann, and Michael Conradt. 2010. Index-based code clone detection: incremental, distributed, scalable. In Software Maintenance (ICSM), 2010 IEEE International Conference on. IEEE, 1–9. Google ScholarDigital Library
- Lingxiao Jiang, Ghassan Misherghi, Zhendong Su, and Stephane Glondu. 2007. Deckard: Scalable and accurate tree-based detection of code clones. In Proceedings of the 29th International Conference on Software Engineering. IEEE Computer Society, 96–105. Google ScholarDigital Library
- Lingxiao Jiang, Zhendong Su, and Edwin Chiu. 2007. Context-based detection of clone-related bugs. In Proceedings of the 6th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering. ACM, 55–64. Google ScholarDigital Library
- Toshihiro Kamiya, Shinji Kusumoto, and Katsuro Inoue. 2002. CCFinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Transactions on Software Engineering 28, 7 (2002), 654–670. Google ScholarDigital Library
- Yalin Ke, Kathryn T Stolee, Claire Le Goues, and Yuriy Brun. 2015. Repairing programs with semantic code search. In Automated Software Engineering (ASE), 2015 30th IEEE/ACM International Conference on. IEEE, 295–306.Google ScholarDigital Library
- Iman Keivanloo, Juergen Rilling, and Ying Zou. 2014. Spotting working code examples. In Proceedings of the 36th International Conference on Software Engineering. ACM, 664–675. Google ScholarDigital Library
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
- Jens Krinke. 2001. Identifying similar code with program dependence graphs. In Reverse Engineering, 2001. Proceedings. Eighth Working Conference on. IEEE, 301–309. Google ScholarDigital Library
- Shuvendu K Lahiri, Chris Hawblitzel, Ming Kawaguchi, and Henrique Rebêlo. 2012. Symdiff: A language-agnostic semantic diff tool for imperative programs. In International Conference on Computer Aided Verification. Springer, 712–717. Google ScholarDigital Library
- Yann LeCun, Yoshua Bengio, et al. 1995. Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361, 10 (1995), 1995. Google ScholarDigital Library
- Liuqing Li, He Feng, Wenjie Zhuang, Na Meng, and Barbara Ryder. 2017. CCLearner: A Deep Learning-Based Clone Detection Approach. In Software Maintenance and Evolution (ICSME), 2017 IEEE International Conference on. IEEE, 249–260.Google ScholarCross Ref
- Zhenmin Li, Shan Lu, Suvda Myagmar, and Yuanyuan Zhou. 2004. CP-Miner: A Tool for Finding Copy-paste and Related Bugs in Operating System Code. In OSDI, Vol. 4. 289–302. Google ScholarDigital Library
- Chao Liu, Chen Chen, Jiawei Han, and Philip S Yu. 2006. GPLAG: detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 872–881. Google ScholarDigital Library
- Flemming Nielson, Hanne R Nielson, and Chris Hankin. 2015. Principles of program analysis. Springer. Google ScholarDigital Library
- Gabriele Paolacci, Jesse Chandler, and Panagiotis G Ipeirotis. 2010. Running experiments on amazon mechanical turk. (2010).Google Scholar
- Nimrod Partush and Eran Yahav. 2014. Abstract semantic differencing via speculative correlation. ACM SIGPLAN Notices 49, 10 (2014), 811–828. Google ScholarDigital Library
- Steven P Reiss. 2009. Semantics-based code search. In Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 243– 253. Google ScholarDigital Library
- Chanchal Kumar Roy and James R Cordy. 2007. A survey on software clone detection research. QueenâĂŹs School of Computing TR 541, 115 (2007), 64–68.Google Scholar
- Chanchal K Roy and James R Cordy. 2008. NICAD: Accurate detection of nearmiss intentional clones using flexible pretty-printing and code normalization. In Program Comprehension, 2008. ICPC 2008. The 16th IEEE International Conference on. IEEE, 172–181. Google ScholarDigital Library
- Chanchal K Roy, James R Cordy, and Rainer Koschke. 2009. Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74, 7 (2009), 470–495. Google ScholarDigital Library
- David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. 1988. Learning representations by back-propagating errors. Cognitive modeling 5, 3 (1988), 1.Google Scholar
- Hitesh Sajnani, Vaibhav Saini, and Cristina Lopes. 2015. A parallel and efficient approach to large scale clone detection. Journal of Software: Evolution and Process 27, 6 (2015), 402–429. Google ScholarDigital Library
- Hitesh Sajnani, Vaibhav Saini, Jeffrey Svajlenko, Chanchal K Roy, and Cristina V Lopes. 2016. SourcererCC: scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering. ACM, 1157–1168. Google ScholarDigital Library
- Eui Chul Richard Shin, Dawn Song, and Reza Moazzezi. 2015. Recognizing Functions in Binaries with Neural Network. In USENIX Security Symposium. 611–626. Google ScholarDigital Library
- Richard Socher, Cliff C Lin, Chris Manning, and Andrew Y Ng. 2011. Parsing natural scenes and natural language with recursive neural networks. In Proceedings of the 28th international conference on machine learning (ICML-11). 129–136. Google ScholarDigital Library
- Kathryn T Stolee, Sebastian Elbaum, and Daniel Dobos. 2014. Solving the search for source code. ACM Transactions on Software Engineering and Methodology (TOSEM) 23, 3 (2014), 26. Google ScholarDigital Library
- Fang-Hsiang Su, Jonathan Bell, Gail Kaiser, and Simha Sethumadhavan. 2016. Identifying functionally similar code in complex codebases. In Program Comprehension (ICPC), 2016 IEEE 24th International Conference on. IEEE, 1–10.Google Scholar
- Fang-hsiang Su, Kenneth Harvey, Simha Sethumadhavan, Gail E Kaiser, and Tony Jebara. 2015. Code Relatives: Detecting Similar Software Behavior. (2015).Google Scholar
- Jeffrey Svajlenko, Judith F Islam, Iman Keivanloo, Chanchal Kumar Roy, and Mohammad Mamun Mia. 2014. Towards a Big Data Curated Benchmark of Inter-project Code Clones.. In ICSME. 476–480. Google ScholarDigital Library
- Jeffrey Svajlenko, Iman Keivanloo, and Chanchal K Roy. 2013. Scaling classical clone detection tools for ultra-large datasets: An exploratory study. In Proceedings of the 7th International Workshop on Software Clones. IEEE Press, 16–22. Google ScholarDigital Library
- Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. 2008. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th International Conference on Machine Learning. ACM, 1096–1103. Google ScholarDigital Library
- Stefan Wagner, Asim Abdulkhaleq, Ivan Bogicevic, Jan-Peter Ostberg, and Jasmin Ramadani. 2016. How are functionally similar code clones syntactically different? An empirical study and a benchmark. PeerJ Computer Science 2 (2016), e49.Google ScholarCross Ref
- Hui-Hui Wei and Ming Li. 2017. Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 3034–3040. Google ScholarDigital Library
- Martin White, Michele Tufano, Christopher Vendome, and Denys Poshyvanyk. 2016. Deep learning code fragments for code clone detection. In Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering. ACM, 87–98. Google ScholarDigital Library
- Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).Google Scholar
- Jiachen Yang, Keisuke Hotta, Yoshiki Higo, Hiroshi Igaki, and Shinji Kusumoto. 2015. Classification model for code clones based on machine learning. Empirical Software Engineering 20, 4 (2015), 1095–1125. Google ScholarDigital Library
Index Terms
- DeepSim: deep learning code functional similarity
Recommendations
Deep learning similarities from different representations of source code
MSR '18: Proceedings of the 15th International Conference on Mining Software RepositoriesAssessing the similarity between code components plays a pivotal role in a number of Software Engineering (SE) tasks, such as clone detection, impact analysis, refactoring, etc. Code similarity is generally measured by relying on manually defined or ...
The Scent of Deep Learning Code: An Empirical Study
MSR '20: Proceedings of the 17th International Conference on Mining Software RepositoriesDeep learning practitioners are often interested in improving their model accuracy rather than the interpretability of their models. As a result, deep learning applications are inherently complex in their structures. They also need to continuously ...
Deep learning application on code clone detection: A review of current knowledge
AbstractBad smells in code are indications of low code quality representing potential threats to the maintainability and reusability of software. Code clone is a type of bad smells caused by code fragments that have the same functional ...
Highlights- Deep neural and recurrent neural networks are used most in detecting code clones.
Comments