Abstract
Code authorship attribution is the process of identifying the author of a given code. With increasing numbers of malware and advanced mutation techniques, the authors of malware are creating a large number of malware variants. To better deal with this problem, methods for examining the authorship of malicious code are necessary. Code authorship attribution techniques can thus be utilized to identify and categorize the authors of malware. This information can help predict the types of tools and techniques that the author of a specific malware uses, as well as the manner in which the malware spreads and evolves. In this article, we present the first comprehensive review of research on code authorship attribution. The article summarizes various methods of authorship attribution and highlights challenges in the field.
- Alex Aiken. 1994. MOSS: A system for detecting software similarity. Retrieved August 29, 2017 from https://theory.stanford.edu/∼aiken/moss/.Google Scholar
- Allan J. Albrecht and John E. Gaffney. 1983. Software function, source lines of code, and development effort prediction: A software science validation. IEEE Transactions on Software Engineering SE-9, 6 (Nov. 1983), 639--648. Google ScholarDigital Library
- Saed Alrabaee, Noman Saleem, Stere Preda, Lingyu Wang, and Mourad Debbabi. 2014. OBA2: An onion approach to binary code authorship attribution. Digital Investigation 11, 1 (May 2014), S94--S103.Google ScholarCross Ref
- Saed Alrabaee, Paria Shirani, Mourad Debbabi, and Lingyu Wang. 2017. On the feasibility of malware authorship attribution. arXiv:1701.02711.Google Scholar
- Saed Alrabaee, Lingyu Wang, and Mourad Debbabi. 2016. BinGold: Towards robust binary analysis by extracting the semantics of binary code as semantic flow graphs (SFGs). Digital Investigation 18 (Aug. 2016), S11--S22. Google ScholarDigital Library
- Bander Alsulami, Edwin Dauber, Richard Harang, Spiros Mancoridis, and Rachel Greenstadt. 2017. Source code authorship attribution using long short-term memory-based networks. In Proceedings of the European Symposium on Research in Computer Security. 65--82.Google ScholarCross Ref
- Ionut Arghire. 2018. WannaMine malware spreads via NSA-linked exploit. Retrieved July 13, 2018 from https://www.securityweek.com/wannamine-malware-spreads-nsa-linked-exploit.Google Scholar
- Upul Bandara and Gamini Wijayarathna. 2013. Source code author identification with unsupervised feature learning. Pattern Recognition Letters 34, 3 (Feb. 2013), 330--334. Google ScholarDigital Library
- Hal L. Berghel and David L. Sallach. 1984. Measurements of program similarity in identical task environments. ACM SIGPLAN Notices 19, 8 (Aug. 1984), 65--76. Google ScholarDigital Library
- P. Vinod Bhattathiripad. 2012. Software piracy forensics: A proposal for incorporating dead codes and other programming blunders as important evidence in AFC test. In Proceedings of the 36th Annual Computer Software and Applications Conference Workshops. IEEE, Los Alamitos, CA, 206--212. Google ScholarDigital Library
- Necdet Bulut and Maurice H. Halstead. 1973. Invariant properties of algorithms. ACM SIGPLAN Notices 8, 6 (1973), 12--13. Google ScholarDigital Library
- Necdet Bulut, Maurice H. Halstead, and Rudolf Bayer. 1974. Experimental validation of a structural property of fortran algorithms. In Proceedings of the 1974 Annual Conference. ACM, New York, 207--211. Google ScholarDigital Library
- S. Burrows and S. M. M. Tahaghoghi. 2007. Source code authorship attribution using N-grams. In Proceedings of the 12th Australasian Document Computing Symposium. 32--39.Google Scholar
- Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. 2009. Application of information retrieval techniques for source code authorship attribution. In Proceedings of the 14th International Conference on Database Systems for Advanced Applications (DASFAA’09). 699--713. Google ScholarDigital Library
- Steven Burrows, Alexandra L. Uitdenbogerd, and Andrew Turpin. 2014. Comparing techniques for authorship attribution of source code. Software: Practice and Experience 44, 1 (Aug. 2014), 1--32.Google ScholarCross Ref
- Aylin Caliskan, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, et al. 2018. When coding style survives compilation: De-anonymizing programmers from executable binaries. In Proceedings of the 2018 Network and Distributed System Security Symposium (NDSS’18).Google ScholarCross Ref
- Aylin Caliskan-Islam, Richard Harang, Andrew Liu, Arvind Narayanan, Clare Voss, Fabian Yamaguchi, et al. 2015. De-anonymizing programmers via code stylometry. In Proceedings of the 24th USENIX Security Symposium. 255--270. Google ScholarDigital Library
- Aylin Caliskan-Islam, Fabian Yamaguchi, Edwin Dauber, Richard Harang, Konrad Rieck, Rachel Greenstadt, et al. 2015. When coding style survives compilation: De-anonymizing programmers from executable binaries. arXiv:1512.08546.Google Scholar
- Dong-Kyu Chae, Sang-Wook Kim, Jiwoon Ha, Sang-Chul Lee, and Gyun Woo. 2013. Software plagiarism detection via the static API call frequency birthmark. In Proceedings of the 28th Annual ACM Symposium on Applied Computing. ACM, New York, NY, 1639--1643. Google ScholarDigital Library
- Evangelos Chatzicharalampous, Georgia Frantzeskou, and Efstathios Stamatatos. 2012. Author identification in imbalanced sets of source code samples. In Proceedings of the 24th International Conference on Tools With Artificial Intelligence. IEEE, Los Alamitos, CA, 790--797. Google ScholarDigital Library
- Rong Chen, Lina Hong, Chunyan Lu, and Wu Deng. 2010. Author identification of software source code with program dependence graphs. In Proceedings of the IEEE 34th Annual Computer Software and Applications Conference Workshops. IEEE, Los Alamitos, CA, 281--286. Google ScholarDigital Library
- Seokwoo Choi, Heewan Park, Hyun il Lim, and Taisook Han. 2009. A static API birthmark for Windows binary executables. Journal of Systems and Software 82, 5 (May 2009), 862--873. Google ScholarDigital Library
- Radhouane Chouchane, Natalia Stakhanova, Andrew Walenstein, and Arun Lakhotia. 2013. Detecting machine-morphed malware variants via engine attribution. Journal of Computer Virology and Hacking Techniques 9, 3 (Sept. 2013), 137--157. Google ScholarDigital Library
- B. Curtis, S. B. Sheppard, P. Milliman, M. A. Borst, and T. Love. 1979. Measuring the psychological complexity of software maintenance tasks with the Halstead and McCabe metrics. IEEE Transactions on Software Engineering SE-5, 2 (March 1979), 96--104. Google ScholarDigital Library
- Charles Curtsinger, Benjamin Livshits, Benjamin Zorn, and Christian Seifert. 2011. Zozzle: Fast and precise in-browser JavaScript malware detection. In Proceedings of the 20th Usenix Security Symposium. 33--48. Google ScholarDigital Library
- Edwin Dauber, Aylin Caliskan, Richard Harang, and Rachel Greenstadt. 2017. Git blame who?: Stylistic authorship attribution of small, incomplete source code fragments. arXiv:1701.05681.Google Scholar
- Edwin Dauber, Aylin Caliskan, Richard Harang, and Rachel Greenstadt. 2018. Git blame who?: Stylistic authorship attribution of small, incomplete source code fragments. In Proceedings of the 40th International Conference on Software Engineering: Companion Proceedings (ICSE’18). ACM, New York, NY, 356--357. Google ScholarDigital Library
- Lucas Davi and Ahmad-Reza Sadeghi (Eds.). 2015. Code-reuse in malware. In Building Secure Defenses Against Code-Reuse Attacks. Springer, Cham, Switzerland, 22--26.Google Scholar
- Haibiao Ding and Mansur H. Samadzadeh. 2004. Extraction of Java program fingerprints for software authorship identification. Journal of Systems and Software 72, 1 (June 2004), 49--57. Google ScholarDigital Library
- John L. Donaldson, Ann-Marie Lancaster, and Paula H. Sposato. 1981. A plagiarism detection system. In Proceedings of the 12th SIGCSE Technical Symposium on Computer Science Education. ACM, New York, NY, 21--25. Google ScholarDigital Library
- Krishna S. R. Dubba and Arun K. Pujari. 2006. N-gram analysis for computer virus detection. Journal in Computer Virology 2, 3 (Dec. 2006), 231--239.Google Scholar
- Bruce S. Elenbogen and Naeem Seliya. 2008. Detecting outsourced student programming assignments. Journal of Computing Sciences in Colleges 23, 3 (2008), 50--57. Google ScholarDigital Library
- J. A. W. Faidhi and S. K. Robinson. 1987. An empirical approach for detecting program similarity and plagiarism within a university programming environment. Computers and Education 11, 1 (Jan. 1987), 11--19. Google ScholarDigital Library
- Brandi Firestine. 2017. Celebrating nine years of GitHub with an anniversary sale. Retrieved July 17, 2018 from https://blog.github.com/2017-04-10-celebrating-nine-years-of-github-with-an-anniversary-sale/.Google Scholar
- Ann Fitzsimmons and Tom Love. 1978. A review and evaluation of software science. ACM Computing Surveys 10, 1 (March 1978), 3--18. Google ScholarDigital Library
- Georgia Frantzeskou, Stefanos Gritzalis, and Stephen G. Macdonell. 2004. Source code authorship analysis for supporting the cybercrime investigation process. In Proceedings of the 1st International Conference on E-Business and Telecommunication Networks. 85--92.Google Scholar
- Georgia Frantzeskou, Efstathios Stamatatos, and Stefanos Gritzalis. 2007. Identifying authorship by byte-level N-grams: The source code author profile (SCAP) method. International Journal of Digital Evidence 6, 1 (2007), 1--18.Google Scholar
- Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. 2006. Effective identification of source code authors using byte-level information. In Proceedings of the 28th International Conference on Software Engineering. ACM, New York, NY, 893--896. Google ScholarDigital Library
- Georgia Frantzeskou, Efstathios Stamatatos, Stefanos Gritzalis, and Sokratis Katsikas. 2006. Source Code Author Identification Based on N-gram Author Profiles. IFIP International Federation for Information Processing, Vol. 204. Springer, Boston, MA.Google Scholar
- Marina L. Gavrilova and Roman V. Yampolskiy. 2010. Applying biometric principles to avatar recognition. In Proceedings of the International Conference on Cyberworlds. IEEE, Los Alamitos, CA, 179--186. Google ScholarDigital Library
- Hugo Gonzalez, Natalia Stakhanova, and Ali A. Ghorbani. 2018. Authorship attribution of Android apps. In Proceedings of the 8th ACM Conference on Data and Application Security and Privacy. ACM, New York, NY, 277--286. Google ScholarDigital Library
- Sam Grier. 1981. A tool that detects plagiarism in Pascal programs. In Proceedings of the 12th SIGCSE Technical Symposium on Computer Science Education. ACM, New York, NY, 15--20. Google ScholarDigital Library
- Dick Grune. 1991. Concise Report on Algorithms in Sim. Report distributed with Sim software.Google Scholar
- Muqaddas Gull, Tehseen Zia, and Muhammad Ilyas. 2017. Source code author attribution using author’s programming style and code smells. International Journal of Intelligent Systems and Applications 9, 5, 27.Google ScholarCross Ref
- Maurice H. Halstead. 1972. Natural laws controlling algorithm structure? ACM SSIGPLAN Newsletter 7, 2 (Feb. 1972), 19--26. Google ScholarDigital Library
- Maurice H. Halstead. 1975. Toward a theoretical basis for estimating programming effort. In Proceedings of the ACM Annual Conference. ACM, New York, NY, 222--224. Google ScholarDigital Library
- Peter G. Hamer and Gillian D. Frewin. 1982. MH Halstead’s software science-a critical examination. In Proceedings of the 6th International Conference on Software Engineering. 197--206. Google ScholarDigital Library
- Jane Huffman Hayes. 2008. Authorship attribution: A principal component and linear discriminant analysis of the consistent programmer hypothesis. International Journal on Computers and Their Applications 15, 2 (2008), 79--99.Google Scholar
- Jane Huffman Hayes and Jeff Offutt. 2010. Recognizing authors: An examination of the consistent programmer hypothesis. Software Testing, Verification and Reliability 20, 4 (Dec. 2010), 329--356. Google ScholarDigital Library
- David I. Holmes. 1998. The evolution of stylometry in humanities scholarship. Literary and Linguistic Computing 13, 3 (Sept. 1998), 111--117.Google ScholarCross Ref
- Mikko Hypponen. 2011. BRAIN: Searching for the first PC virus in Pakistan. Retrieved July 4, 2017 from http://campaigns.f-secure.com/brain/virus.html.Google Scholar
- Jeong-Hoon Ji, Gyun Woo, and Hwan-Gue Cho. 2008. A plagiarism detection technique for Java program using bytecode analysis. In Proceedings of the 3rd International Conference on Convergence and Hybrid Information Technology. IEEE, Los Alamitos, CA, 1092--1098. Google ScholarDigital Library
- Vlado Kešelj, Fuchun Peng, Nick Cercone, and Calvin Thomas. 2003. N-gram-based author profiles for authorship attribution. In Proceedings of the Pacific Association for Computational Linguistics, Vol. 3. 255--264.Google Scholar
- Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. Journal of the American Society for Information Science and Technology 60, 1 (Jan. 2009), 9--26. Google ScholarDigital Library
- Jay Kothari, Maxim Shevertalov, Edward Stehle, and Spiros Mancoridis. 2007. A probabilistic approach to source code authorship identification. In Proceedings of the 4th International Conference on Information Technology. IEEE, Los Alamitos, CA, 243--248. Google ScholarDigital Library
- Ivan Krsul and Eugene Spafford. 1997. Authorship analysis: Identifying the author of a program. Computers and Security 16, 3 (Oct. 1997), 233--257. Google ScholarDigital Library
- Ivan Krsul and Eugene H. Spafford. 1994. Authorship Analysis: Identifying the Author of a Program. Technical Report 96-052. Purdue University, West Lafayette, IN.Google Scholar
- Ivan Krsul and Eugene H. Spafford. 1997. Authorship analysis: Identifying the author of a program. Computers and Security 16, 3 (Jan. 1997), 233--257. Google ScholarDigital Library
- Vaibhavi Kulgutkar, Natalia Stakhanova, Paul Cook, and Alina Matyukhina. 2018. Authorship attribution through string analysis. In Proceedings of the 13th International Conference on Availability, Reliability and Security (ARES’18). ACM, New York, NY. Google ScholarDigital Library
- Robert Charles Lange and Spiros Mancoridis. 2007. Using code metric histograms and genetic algorithms to perform author identification for software forensics. In Proceedings of the 9th Annual Conference on Genetic and Evolutionary Computation. 2082--2089. Google ScholarDigital Library
- Tımea László and Ákos Kiss. 2009. Obfuscating C++ programs via control flow flattening. Annales Universitatis Scientarum Budapestinensis de Rolando Eötvös Nominatae, Sectio Computatorica 30, 1 (Aug. 2009), 3--19.Google Scholar
- Robert Layton and Ahmad Azab. 2014. Authorship analysis of the Zeus botnet source code. In Proceedings of the 5th Cybercrime and Trustworthy Computing Conference. IEEE, Los Alamitos, CA, 38--43. Google ScholarDigital Library
- Robert Layton, Paul Watters, and Richard Dazeley. 2010. Automatically determining phishing campaigns using the USCAP methodology. In Proceedings of the 2010 eCrime Researchers Summit. 1--8.Google ScholarCross Ref
- Robert Layton, Paul Watters, and Richard Dazeley. 2012. Unsupervised authorship analysis of phishing Webpages. In Proceedings of the International Symposium on Communications and Information Technologies. IEEE, Los Alamitos, CA, 1104--1109.Google ScholarCross Ref
- Ronald J. Leach. 1995. Using metrics to evaluate student programs. ACM SIGCSE Bulletin 27, 2 (June 1995), 41--43. Google ScholarDigital Library
- Chao Liu, Chen Chen, Jiawei Han, and Philip S. Yu. 2006. GPLAG: Detection of software plagiarism by program dependence graph analysis. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, New York, NY, 872--881. Google ScholarDigital Library
- Thomas J. McCabe. 1976. A complexity measure. IEEE Transactions on Software Engineering SE-2, 4 (Dec. 1976), 308--320. Google ScholarDigital Library
- Xiaozhu Meng. 2016. Fine-grained binary code authorship identification. In Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, New York, NY, 1097--1099. Google ScholarDigital Library
- Paul W. Oman and Curt R. Cook. 1989. Programming style authorship analysis. In Proceedings of the 17th ACM Annual Computer Science Conference (CSC’89). 320--326. Google ScholarDigital Library
- Karl J. Ottenstein. 1976. An algorithmic approach to the detection and prevention of plagiarism. ACM SIGCSE Bulletin 8, 4 (Dec. 1976), 30--41. Google ScholarDigital Library
- Alan Parker and James O. Hamblen. 1989. Computer algorithms for plagiarism detection. IEEE Transactions on Education 32, 2 (May 1989), 94--99. Google ScholarDigital Library
- Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding plagiarisms among a set of programs with JPlag. Journal of Universal Computer Science 8, 11 (Nov. 2002), 1016--1038.Google Scholar
- Nathan Rosenblum, Barton P. Miller, and Xiaojin Zhu. 2011. Recovering the toolchain provenance of binary code. In Proceedings of the 2011 International Symposium on Software Testing and Analysis. ACM, New York, NY, 100--110. Google ScholarDigital Library
- Nathan Rosenblum, Xiaojin Zhu, and Barton P. Miller. 2011. Who Wrote This Code? Identifying the Authors of Program Binaries. Lecture Notes in Computer Science, Vol. 6879. Springer, 172--189.Google Scholar
- Nathan E. Rosenblum. 2011. The Provenance Hierarchy of Computer Programs. Ph.D. Dissertation. University of Wisconsin, Madison. Google ScholarDigital Library
- Nathan E. Rosenblum, Barton P. Miller, and Xiaojin Zhu. 2010. Extracting compiler provenance from program binaries. In Proceedings of the 9th ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering. ACM, New York, NY, 21--28. Google ScholarDigital Library
- Philip J. Sallis, Asbjorn Aakjaer, and Stephen G. Macdonell. 1996. Software forensics: Old methods for a new science. In Proceedings of the International Conference Software Engineering: Education and Practice. IEEE, Los Alamitos, CA, 481--485. Google ScholarDigital Library
- Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. 2003. Winnowing: Local algorithms for document fingerprinting. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM, New York, NY, 76--85. Google ScholarDigital Library
- V. Y. Shen, S. D. Conte, and H. E. Dunsmore. 1983. Software science revisited: A critical analysis of the theory and its empirical support. IEEE Transactions on Software Engineering SE-9, 2 (March 1983), 155--165. Google ScholarDigital Library
- Maxim Shevertalov, Jay Kothari, Edward Stehle, and Spiros Mancoridis. 2009. On the use of discretized source code metrics for author identification. In Proceedings of the IEEE 1st International Symposium on Search Based Software Engineering. IEEE, Los Alamitos, CA, 69--78. Google ScholarDigital Library
- Lucy Simko, Luke Zettlemoyer, and Tadayoshi Kohno. 2018. Recognizing and imitating programmer style: Adversaries in program authorship attribution. Proceedings on Privacy Enhancing Technologies 2018, 1 (2018), 127--144.Google ScholarCross Ref
- Eugene H. Spafford and Stephen A. Weeber. 1993. Software forensics: Can we track code to its authors? Computers and Security 12, 6 (1993), 585--595. Google ScholarDigital Library
- Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the Association for Information Science and Technology 60, 3 (March 2009), 538--556. Google ScholarDigital Library
- Benno Stein, Nedim Lipka, and Peter Prettenhofer. 2011. Intrinsic plagiarism analysis. Language Resources and Evaluation 45, 1 (March 2011), 63--82. Google ScholarDigital Library
- Matthew F. Tennyson and Francisco J. Mitropoulos. 2014. Choosing a profile length in the SCAP method of source code authorship attribution. In Proceedings of the 2014 IEEE SoutheastCon. IEEE, Los Alamitos, CA, 1--6.Google Scholar
- Kristina L. Verco and Michael J. Wise. 1996. Software for detecting suspected plagiarism: Comparing structure and attribute-counting systems. In Proceedings of the 1st Australasian Conference on Computer Science Education. ACM, New York, NY, 81--88. Google ScholarDigital Library
- Andrew Walenstein and Arun Lakhotia. 2007. The software similarity problem in malware analysis. In Proceedings of the Conference on Duplication, Redundancy, and Similarity in Software. 1--10.Google Scholar
- Xinran Wang, Yoon-Chan Jhi, Sencun Zhu, and Peng Liu. 2009. Behavior based software theft detection. In Proceedings of the 16th ACM Conference on Computer and Communications Security. ACM, New York, NY, 280--290. Google ScholarDigital Library
- Geoff Whale. 1990. Identification of program similarity in large populations. Computer Journal 33, 2 (April 1990), 140--146. Google ScholarDigital Library
- Michael J. Wise. 1996. YAP3: Improved detection of similarities in computer program and other texts. ACM SIGCSE Bulletin 28, 1 (March 1996), 130--134. Google ScholarDigital Library
- Wilco Wisse and Cor Veenman. 2015. Scripting dna: Identifying the JavaScript programmer. Digital Investigation 15 (2015), 61--71. Google ScholarDigital Library
- Xinyu Yang, Guoai Xu, Qi Li, Yanhui Guo, and Miao Zhang. 2017. Authorship attribution of source code by using back propagation neural network based on particle swarm optimization. PloS One 12, 11 (2017), e0187204.Google ScholarCross Ref
- Chunxia Zhang, Sen Wang, Jiayu Wu, and Zhendong Niu. 2017. Authorship identification of source codes. In Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint Conference on Web and Big Data. 282--296.Google ScholarCross Ref
Index Terms
- Code Authorship Attribution: Methods and Challenges
Recommendations
Authorship Analysis of the Zeus Botnet Source Code
CTC '14: Proceedings of the 2014 Fifth Cybercrime and Trustworthy Computing ConferenceAuthorship analysis has been used successfully to analyse the provenance of source code files in previous studies. The source code for Zeus, one of the most damaging and effective botnets to date, was leaked in 2011. In this research, we analyse the ...
Untargeted Code Authorship Evasion with Seq2Seq Transformation
Computational Data and Social NetworksAbstractCode authorship attribution is the problem of identifying authors of programming language codes through the stylistic features in their codes, a topic that recently witnessed significant interest with outstanding performance. In this work, we ...
Authorship Attribution of Android Apps
CODASPY '18: Proceedings of the Eighth ACM Conference on Data and Application Security and PrivacySince the first computer virus hit the Advanced Research Projects Agency Network (ARPANET) in the early 1970s, the security community interest revolved around ways to expose the identities of malware writers. Knowledge of the adversarial identities ...
Comments