ABSTRACT
In this paper, we propose a malware categorization method that models malware behavior in terms of instructions using PageRank. PageRank computes ranks of web pages based on structural information and can also compute ranks of instructions that represent the structural information of the instructions in malware analysis methods. Our malware categorization method uses the computed ranks as features in machine learning algorithms. In the evaluation, we compare the effectiveness of different PageRank algorithms and also investigate bagging and boosting algorithms to improve the categorization accuracy.
- Ye, Y., Li, T., Chen, Y., and Jiang, Q. 2010. Automatic malware categorization using cluster ensemble. In Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC., USA, 95--104. Google ScholarDigital Library
- Page, L., Brin, S., Motwani, R. and Winograd, T. 1998. The PageRank citation ranking: bringing order to the web. Technical Report, Stanford InfoLab.Google Scholar
- Kumar, G., Duhan, N., and Sharma, A. K. 2011. Page ranking based on number of visits of links of web page. In Proceedings of the 2nd International Conference on Computer and Communication Technology, Allahabad, India, 11--14.Google Scholar
- Bilar, D. 2007. Opcodes as predictor for malware. International Journal of Electronic Security and Digital Forensics, 11(2), 156--168. Google ScholarDigital Library
- Rad, B. B., and Masrom, M. 2010. Metamorphic virus variants classification using opcode frequency histogram. In Proceedings of the 14th WSEAS International Conference on COMPUTERS, Greece, 147--155. Google ScholarDigital Library
- Santamarta, R. 2006. Generic detection and classification of polymorphic malware using neural pattern recognition. http://www.reversemode.com.Google Scholar
- Kang, B., Han, K. S., Kang, B., and Im, E. G. 2014. Malware categorization using dynamic mnemonic frequency analysis with redundancy filtering. Digital Investigation, 11(4), 323--335.Google ScholarDigital Library
- Abou-Assaleh, T., Cercone, N., Keselj, V., and Sweidan, R. 2004. Detection of new malicious code using n-grams signatures. PST, 193--196.Google Scholar
- Kolter, J. and Maloof, M. 2004. Learning to detect malicious executables in the wild. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA, 470--478. Google ScholarDigital Library
- Reddy, K. and Pujari, A. 2006. N-gram analysis for computer virus detection. Journal in Computer Virology, 2, 231--239.Google ScholarCross Ref
- Santos, I., Brezo, F., Ugrate-Pedrero, X., and Bringas, P. G. 2011. Opcode sequences as representation of executables for data-mining-based unknown malware detection. Information Science, 231, 64--82. Google ScholarDigital Library
- Gao, D., Reiter, M. K., and Song, D. 2008. BinHunt: automatically finding semantic differences in binary programs. Information and Communications Security, Lecture Notes in Computer Science, 5308, 238--255. Google ScholarDigital Library
- Cesare, S. and Xiang, Y. 2010. Classification of malware using structured control flow. In Proceedings of the 8th Australasian Symposium on Parallel and Distributed Computing, 108, 61--70. Google ScholarDigital Library
- Briones, I. and Gomez, A. 2008. Graphs, entropy and grid computing: automatic comparison of malware. In Proceedings of the 2008 Virus Bulletin Conference, 1--12.Google Scholar
- Chae, D., Ha, J., Kim, S., Kang, B., Im, E. G. 2013. Software plagiarism detection: a graph-based approach., In Proceedings of the 22nd ACM International Conference on Conference on Information & Knowledge Management, New York, USA, 1577--1580. Google ScholarDigital Library
- VxHeaven http://vxheaven.orgGoogle Scholar
- Luk, C., Cohn, R., Muth, R., Patil, H., Klauser, A., Lowney, G., Wallace, S., Reddi, W. J., and Hazelwood, K. 2005. Pin: building customized program analysis tool with dynamic instrumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, 190--200. Google ScholarDigital Library
- Breiman, L. 2001. Random forests. Machine Learning, 45(1), 5--32. Google ScholarDigital Library
- Breiman, L. 1996. Bagging predictors. Machine Learning, 24(2), 123--140. Google ScholarCross Ref
- Freund, Y. and Schapire, R. E. 1996. Experiments with a new boosting algorithm. In Proceedings of the 13th International Conference on Machine Learning, San Francisco, USA, 148--156.Google Scholar
- Webb, G. I. 2000. MultiBoosting: a technique for combining boosting and wagging. Machine Learning, 40(2). Google ScholarDigital Library
- Hall, M., Frank, E., Holmes, G., Pfahriger,B., Reutemann, P., and Witten, I. H. 2009. The WEKA data mining software: an update. SIGKDD Explorations, 11(1). Google ScholarDigital Library
- Han, J. and Kamber, M. 2006. Data mining: concepts and techniques (2nd edition). Morgan Kaufmann Publishers. Google ScholarDigital Library
Index Terms
- PageRank in malware categorization
Recommendations
Malware categorization using dynamic mnemonic frequency analysis with redundancy filtering
The battle between malware developers and security analysts continues, and the number of malware and malware variants keeps increasing every year. Automated malware generation tools and various detection evasion techniques are also developed every year. ...
Beyond PageRank: machine learning for static ranking
WWW '06: Proceedings of the 15th international conference on World Wide WebSince the publication of Brin and Page's paper on PageRank, many in the Web community have depended on PageRank for the static (query-independent) ordering of Web pages. We show that we can significantly outperform PageRank using features that are ...
Towards an Automatic Method for API Association Extraction for PE-Malware Categorization
IPAC '15: Proceedings of the International Conference on Intelligent Information Processing, Security and Advanced CommunicationBehavior-based malware detection techniques remain one of the most efficient protections against malicious programs. Such techniques are based on constructing models representing malicious and legitimate behaviors by analyzing the set of APIs (...
Comments