ABSTRACT
The Normalized Compression Distance (NCD) has been used in a number of domains to compare objects with varying feature types. This flexibility comes from the use of general purpose compression algorithms as the means of computing distances between byte sequences. Such flexibility makes NCD particularly attractive for cases where the right features to use are not obvious, such as malware classification. However, NCD can be computationally demanding, thereby restricting the scale at which it can be applied. We introduce an alternative metric also inspired by compression, the Lempel-Ziv Jaccard Distance (LZJD). We show that this new distance has desirable theoretical properties, as well as comparable or superior performance for malware classification, while being easy to implement and orders of magnitude faster in practice.
- Nadia Alshahwan, Earl T Barr, David Clark, and George Danezis 2015. Detecting Malware with Information Complexity. (2 2015). showURL%http://arxiv.org/abs/1502.07661Google Scholar
- Daniel Arp, Michael Spreitzenbarth, Hubner Malte, Hugo Gascon, and Konrad Rieck. 2014. Drebin: Effective and Explainable Detection of Android Malware in Your Pocket. Symposium on Network and Distributed System Security (NDSS) February (2014), 23--26. https://doi.org/10.14722/ndss.2014.23247Google Scholar
- Michael Bailey, Jon Oberheide, Jon Andersen, Z Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated Classification and Analysis of Internet Malware Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection (RAID'07). Springer-Verlag, Berlin, Heidelberg, 178--197. http://dl.acm.org/citation.cfm?id=1776434.1776449Google Scholar
- Rebecca Schuller Borbely. 2015. On normalized compression distance and large malware. Journal of Computer Virology and Hacking Techniques (2015), 1--8. 1007/978-3-642-04342-0_7Google Scholar
- Nicholas Tran. 2007. The normalized compression distance and image distinguishability Proc. SPIE 6492, Human Vision and Electronic Imaging XII, Bernice E. Rogowitz, Thrasyvoulos N. Pappas, and Scott J. Daly (Eds.), Vol. Vol. 64921D. https://doi.org/10.1117/12.704334Google Scholar
- Stephanie Wehner. 2007. Analyzing Worms and Network Traffic Using Compression. J. Comput. Secur., Vol. 15, 3 (8 2007), 303--320. ISSN0926-227X http://dl.acm.org/citation.cfm?id=1370628.1370630Google Scholar
- Wing Wong and Mark Stamp 2006. Hunting for metamorphic engines. Journal in Computer Virology Vol. 2, 3 (2006), 211--229. ISSN1772-9904 https://doi.org/10.1007/s11416-006-0028-7Google ScholarCross Ref
- Wei Yan, Zheng Zhang, and Nirwan Ansari 2008. Revealing Packed Malware. IEEE Security and Privacy Vol. 6, 5 (9 2008), 65--69. ISSN1540-7993 https://doi.org/10.1109/MSP.2008.126Google ScholarDigital Library
- Jacob Ziv and Abraham Lempel 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory Vol. 23, 3 (5 1977), 337--343. ISSN0018--9448 https://doi.org/10.1109/TIT.1977.1055714Google ScholarDigital Library
- Jacob Ziv and Abraham Lempel 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory Vol. 24, 5 (9 1978), 530--536. ISSN0018-9448 https://doi.org/10.1109/TIT.1978.1055934Google Scholar
Index Terms
- An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance
Recommendations
Similarity Calculation with Length Delimiting Dictionary Distance
ICTAI '11: Proceedings of the 2011 IEEE 23rd International Conference on Tools with Artificial IntelligenceThe Normalized Compression Distance (NCD) has gained considerable interest in pattern recognition as a similarity measure applicable to unstructured data of very different domains, such as text, DNA sequences, or images. NCD uses existing compression ...
Normalized Compression Distance of Multisets with Applications
Pairwise normalized compression distance (NCD) is a parameter-free, feature-free, alignment-free, similarity metric based on compression. We propose an NCD of multisets that is also metric. Previously, attempts to obtain such an NCD failed. For ...
A Model Conditioned Data Compression Based Similarity Measure
DCC '08: Proceedings of the Data Compression ConferenceMany methodologies and similarity measures based on data compression have been recently introduced to compute similarities between general kinds of data. Two important similarity indices are the Normalized Compression Distance (NCD), and the Pattern ...
Comments