skip to main content
10.1145/1653662.1653736acmconferencesArticle/Chapter ViewAbstractPublication PagesccsConference Proceedingsconference-collections
research-article

Large-scale malware indexing using function-call graphs

Published: 09 November 2009 Publication History

Abstract

A major challenge of the anti-virus (AV) industry is how to effectively process the huge influx of malware samples they receive every day. One possible solution to this problem is to quickly determine if a new malware sample is similar to any previously-seen malware program. In this paper, we design, implement and evaluate a malware database management system called SMIT (Symantec Malware Indexing Tree) that can efficiently make such determination based on malware's function-call graphs, which is a structural representation known to be less susceptible to instruction-level obfuscations commonly employed by malware writers to evade detection of AV software. Because each malware program is represented as a graph, the problem of searching for the most similar malware program in a database to a given malware sample is cast into a nearest-neighbor search problem in a graph database. To speed up this search, we have developed an efficient method to compute graph similarity that exploits structural and instruction-level information in the underlying malware programs, and a multi-resolution indexing scheme that uses a computationally economical feature vector for early pruning and resorts to a more accurate but computationally more expensive graph similarity function only when it needs to pinpoint the most similar neighbors. Results of a comprehensive performance study of the SMIT prototype using a database of more than 100,000 malware demonstrate the effective pruning power and scalability of its nearest neighbor search mechanisms.

References

[1]
Armadillo. http://www.siliconrealms.com/armadillo.htm, 2008.
[2]
Peid 0.95. http://www.peid.info/, 2008.
[3]
Trid v2.02. http://mark0.net/soft-trid-e.html, 2008.
[4]
M. Bailey, J. Oberheide, J. Andersen, Z. M. Mao, F. Jahanian, and J. Nazario. Automated classification and analysis of internet malware. In RAID, pages 178--197, 2007.
[5]
U. Bayer, P. Milani Comparetti, C. Hlauscheck, C. Kruegel, and E. Kirda. Scalable, Behavior-Based Malware Clustering. In 16th Symposium on Network and Distributed System Security, 2009.
[6]
T. Bozkaya and M. Ozsoyoglu. Distance-based indexing for high-dimensional metric spaces. In In Proc. ACM SIGMOD International Conference on Management of Data, 1997.
[7]
I. Briones and A. Gomez. Graphs, entropy and grid computing: Automatic comparison of malware. In Proceedings of the 2004 Virus Bulletin Conference, 2004.
[8]
E. Carrera and G. Erdelyi. Digital genome mapping al advanced binary malware analysis. In Proceedings of the 2004 Virus Bulletin Conference, 2004.
[9]
T.-c. Chiueh. Content-based image indexing. In VLDB'94: Proceedings of the 20th International Conference on Very Large Data Bases, pages 582--593, 1994.
[10]
S. Das, A. Mistry, D. Negoescu, G. Reed, and S. K. Singh. A graph matching problem. Techical report, IPAM Research in Industrial Projects for Students (RIPS), 2008.
[11]
M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP--Completeness. W. H. Freeman, 1979.
[12]
F. Guo, P. Ferrie, and T.-C. Chiueh. A study of the packer problem and its solutions. In RAID'08, pages 98--115, 2008.
[13]
L. C. Harris and B. P. Miller. Practical analysis of stripped binary code. SIGARCH Comput. Archit. News, 33(5):63--68, 2005.
[14]
H. He and A. K. Singh. Closure-tree: An index structure for graph queries. In ICDE'06: Proceedings of the 22nd International Conference on Data Engineering, page 38, 2006.
[15]
Hex-rays. The IDA Pro Disassembler and Debugger. http://www.hexrays.com/idapro/, 2008.
[16]
X. Hu, T. cker Chiueh, and K. G. Shin. Large-scale malware indexing using function-call graphs (extended). Technical Report, Department of Computer Sicence, University of Michigan, 2009.
[17]
Ilfak Guilfanov. Fast Library Identification and Recognition Technology. http://www.hex-rays.com/idapro/flirt.htm, 1997.
[18]
D. Justice. A binary linear programming formulation of the graph edit distance. IEEE Trans. Pattern Anal. Mach. Intell., 28(8):1200--1214, 2006. Fellow-Hero, Alfred.
[19]
J. Z. Kolter and M. A. Maloof. Learning to detect and classify malicious executables in the wild. J. Mach. Learn. Res., 7:2721--2744, 2006.
[20]
C. Kruegel, E. Kirda, D. Mutz, W. Robertson, and G. Vigna. Polymorphic worm detection using structural information of executables. In In RAID, pages 207--226. Springer-Verlag, 2005.
[21]
H. W. Kuhn. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 1955.
[22]
L. Martignoni, M. Christodorescu, and S. Jha. Omniunpack: Fast, generic, and safe unpacking of malware. In In Proceedings of the Annual Computer Security Applications Conference (ACSAC, 2007.
[23]
R. Myers, R. C. Wilson, and E. R. Hancock. Bayesian graph edit distance. IEEE Trans. Pattern Anal. Mach. Intell., 22(6), 2000.
[24]
M. Neuhaus and H. Bunke. An error-tolerant approximate matching algorithm for attributed planar graphs and its application to fingerprint classification. In SSPR/SPR, pages 180--189, 2004.
[25]
M. Pietrek. An In-Depth Look into the Win32 PE File Format. http://msdn.microsoft.com/en-us/magazine/cc301805.aspx, 2002.
[26]
J. Raber and E. Laspe. Deobfuscator: An automated approach to the identification and removal of code obfuscation. Reverse Engineering, Working Conference on, 0:275--276, 2007.
[27]
K. Rieck, T. Holz, C. Willems, P. Düssel, and P. Laskov. Learning and classification of malware behavior. In DIMVA '08, pages 108---125, 2008.
[28]
K. Riesen, M. Neuhaus, and H. Bunke. Bipartite graph matching for computing the edit distance of graphs. In Graph-Based Representations in Pattern Recognition, volume 4538, 2007.
[29]
D. Shasha, Jason, and R. Giugno. Algorithmics and applications of tree and graph searching. In Symposium on Principles of Database Systems, pages 39--52, 2002.
[30]
Symantec Corp. Symantec Global Internet Security Threat Report. Volume XII. http://www.symantec.com/, April 2008.
[31]
Y. Tian and J. M. Patel. Tale: A tool for approximate large graph matching. In ICDE, pages 963--972, 2008.
[32]
T.Lee and J.J.Mody. Behavioral classification. http://www.microsoft.com/downloads/details.aspx?FamilyID=7b5d8cc8--b336--4091--abb5--2cc500a6c41a&displaylang=en,2006.
[33]
VMProtect. Vmprotect. http://www.vmprotect.ru/, 2008.
[34]
X. Yan, P. S. Yu, and J. Han. Substructure similarity search in graph databases. In SIGMOD'05: Proceedings of the 2005 ACM SIGMOD international conference on Management of data, 2005.
[35]
P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In SODA: ACM-SIAM Symposium on Discrete Algorithms, 1993.
[36]
P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach. Springer, 2006.
[37]
P. Zezula, P. Ciaccia, and F. Rabitti. M-tree: A dynamic index for similarity queries in multimedia databases. Technical Report 7, HERMES ESPRIT LTR Project, 1996.
[38]
K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput.,18(6):1245--1262, 1989.
[39]
P. Zhao, J. X. Yu, and P. S. Yu. Graph indexing: tree + delta ‹ graph. In VLDB'07: Proceedings of the 33rd international conference on Very large data bases, pages 938--949, 2007.

Cited By

View all
  • (2024)Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNNChinese Journal of Electronics10.23919/cje.2022.00.22833:1(128-138)Online publication date: Jan-2024
  • (2024)RCFG2Vec: Considering Long-Distance Dependency for Binary Code Similarity DetectionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695070(770-782)Online publication date: 27-Oct-2024
  • (2024)Android Malware Detection Method Based on CNN and DNN Bybrid MechanismIEEE Transactions on Industrial Informatics10.1109/TII.2024.336301620:5(7744-7753)Online publication date: May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CCS '09: Proceedings of the 16th ACM conference on Computer and communications security
November 2009
664 pages
ISBN:9781605588940
DOI:10.1145/1653662
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 November 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. graph similarity
  2. malware indexing
  3. multi-resolution indexing

Qualifiers

  • Research-article

Conference

CCS '09
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,261 of 6,999 submissions, 18%

Upcoming Conference

CCS '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)80
  • Downloads (Last 6 weeks)13
Reflects downloads up to 18 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Fast Cross-Platform Binary Code Similarity Detection Framework Based on CFGs Taking Advantage of NLP and Inductive GNNChinese Journal of Electronics10.23919/cje.2022.00.22833:1(128-138)Online publication date: Jan-2024
  • (2024)RCFG2Vec: Considering Long-Distance Dependency for Binary Code Similarity DetectionProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering10.1145/3691620.3695070(770-782)Online publication date: 27-Oct-2024
  • (2024)Android Malware Detection Method Based on CNN and DNN Bybrid MechanismIEEE Transactions on Industrial Informatics10.1109/TII.2024.336301620:5(7744-7753)Online publication date: May-2024
  • (2024)Strtune: Data Dependence-Based Code Slicing for Binary Similarity Detection With Fine-Tuned RepresentationIEEE Transactions on Information Forensics and Security10.1109/TIFS.2024.348494419(10233-10245)Online publication date: 2024
  • (2024)Foundational Models for Malware Embeddings Using Spatio-Temporal Parallel Convolutional Networks2024 International Conference on Computing, Networking and Communications (ICNC)10.1109/ICNC59896.2024.10556354(168-174)Online publication date: 19-Feb-2024
  • (2024)An empirical study of problems and evaluation of IoT malware classification label sourcesJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10189836:1(101898)Online publication date: Jan-2024
  • (2024)Parallelization of butterfly counting on hierarchical memoryThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00856-x33:5(1453-1484)Online publication date: 1-Sep-2024
  • (2023)EMBERSimProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667283(26722-26743)Online publication date: 10-Dec-2023
  • (2023)IoTSim: Internet of Things-Oriented Binary Code Similarity Detection with Multiple Block RelationsSensors10.3390/s2318778923:18(7789)Online publication date: 11-Sep-2023
  • (2023)BlockMatch: A Fine-Grained Binary Code Similarity Detection Approach Using Contrastive Learning for Basic Block MatchingApplied Sciences10.3390/app13231275113:23(12751)Online publication date: 28-Nov-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media