skip to main content
10.1145/3097983.3098111acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance

Published:04 August 2017Publication History

ABSTRACT

The Normalized Compression Distance (NCD) has been used in a number of domains to compare objects with varying feature types. This flexibility comes from the use of general purpose compression algorithms as the means of computing distances between byte sequences. Such flexibility makes NCD particularly attractive for cases where the right features to use are not obvious, such as malware classification. However, NCD can be computationally demanding, thereby restricting the scale at which it can be applied. We introduce an alternative metric also inspired by compression, the Lempel-Ziv Jaccard Distance (LZJD). We show that this new distance has desirable theoretical properties, as well as comparable or superior performance for malware classification, while being easy to implement and orders of magnitude faster in practice.

References

  1. Nadia Alshahwan, Earl T Barr, David Clark, and George Danezis 2015. Detecting Malware with Information Complexity. (2 2015). showURL%http://arxiv.org/abs/1502.07661Google ScholarGoogle Scholar
  2. Daniel Arp, Michael Spreitzenbarth, Hubner Malte, Hugo Gascon, and Konrad Rieck. 2014. Drebin: Effective and Explainable Detection of Android Malware in Your Pocket. Symposium on Network and Distributed System Security (NDSS) February (2014), 23--26. https://doi.org/10.14722/ndss.2014.23247Google ScholarGoogle Scholar
  3. Michael Bailey, Jon Oberheide, Jon Andersen, Z Morley Mao, Farnam Jahanian, and Jose Nazario. 2007. Automated Classification and Analysis of Internet Malware Proceedings of the 10th International Conference on Recent Advances in Intrusion Detection (RAID'07). Springer-Verlag, Berlin, Heidelberg, 178--197. http://dl.acm.org/citation.cfm?id=1776434.1776449Google ScholarGoogle Scholar
  4. Rebecca Schuller Borbely. 2015. On normalized compression distance and large malware. Journal of Computer Virology and Hacking Techniques (2015), 1--8. 1007/978-3-642-04342-0_7Google ScholarGoogle Scholar
  5. Nicholas Tran. 2007. The normalized compression distance and image distinguishability Proc. SPIE 6492, Human Vision and Electronic Imaging XII, Bernice E. Rogowitz, Thrasyvoulos N. Pappas, and Scott J. Daly (Eds.), Vol. Vol. 64921D. https://doi.org/10.1117/12.704334Google ScholarGoogle Scholar
  6. Stephanie Wehner. 2007. Analyzing Worms and Network Traffic Using Compression. J. Comput. Secur., Vol. 15, 3 (8 2007), 303--320. ISSN0926-227X http://dl.acm.org/citation.cfm?id=1370628.1370630Google ScholarGoogle Scholar
  7. Wing Wong and Mark Stamp 2006. Hunting for metamorphic engines. Journal in Computer Virology Vol. 2, 3 (2006), 211--229. ISSN1772-9904 https://doi.org/10.1007/s11416-006-0028-7Google ScholarGoogle ScholarCross RefCross Ref
  8. Wei Yan, Zheng Zhang, and Nirwan Ansari 2008. Revealing Packed Malware. IEEE Security and Privacy Vol. 6, 5 (9 2008), 65--69. ISSN1540-7993 https://doi.org/10.1109/MSP.2008.126Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jacob Ziv and Abraham Lempel 1977. A universal algorithm for sequential data compression. IEEE Transactions on Information Theory Vol. 23, 3 (5 1977), 337--343. ISSN0018--9448 https://doi.org/10.1109/TIT.1977.1055714Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Jacob Ziv and Abraham Lempel 1978. Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory Vol. 24, 5 (9 1978), 530--536. ISSN0018-9448 https://doi.org/10.1109/TIT.1978.1055934Google ScholarGoogle Scholar

Index Terms

  1. An Alternative to NCD for Large Sequences, Lempel-Ziv Jaccard Distance

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
          August 2017
          2240 pages
          ISBN:9781450348874
          DOI:10.1145/3097983

          Copyright © 2017 ACM

          Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 4 August 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          KDD '17 Paper Acceptance Rate64of748submissions,9%Overall Acceptance Rate1,133of8,635submissions,13%

          Upcoming Conference

          KDD '24

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader