skip to main content
10.1145/2566486.2568017acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Efficient estimation for high similarities using odd sketches

Published:07 April 2014Publication History

ABSTRACT

Estimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the symmetric difference of the two sets. This means that Odd Sketches provide a highly space-efficient estimator for sets of high similarity, which is relevant in applications such as web duplicate detection, collaborative filtering, and association rule learning. The method extends to weighted Jaccard similarity, relevant e.g. for TF-IDF vector comparison. We present a theoretical analysis of the quality of estimation to guarantee the reliability of Odd Sketch-based estimators. Our experiments confirm this efficiency, and demonstrate the efficiency of Odd Sketches in comparison with $b$-bit minwise hashing schemes on association rule learning and web duplicate detection tasks.

References

  1. Y. Bachrach, E. Porat, and J. S. Rosenschein. Sketching techniques for collaborative filtering. In IJCAI, pages 2016--2021, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, pages 1--10. Springer, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Bendersky and W. B. Croft. Finding text reuse on the web. In WSDM, pages 262--271, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Z. Broder. On the resemblance and containment of documents. Sequences, pages 21--29, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630--659, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD, pages 219--228, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng., 13(1):64--78, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Gollapudi and R. Panigrahy. Exploiting asymmetry in hierarchical topic extraction. In CIKM, pages 475--482, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. In WWW, pages 381--390, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In ICDM, pages 246--255, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Kutzkov, A. Bifet, F. Bonchi, and A. Gionis. Strip: stream learning of influence probabilities. In KDD, pages 275--283, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Li and A. C. König. b-bit minwise hashing. In WWW, pages 671--680, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Li, A. Shrivastava, J. L. Moore, and A. C. König. Hashing algorithms for large-scale learning. In NIPS, pages 2672--2680, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. MSR-TR-2010--73 technical report, 2010.Google ScholarGoogle Scholar
  18. G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In WWW, pages 141--150, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. F. Schuster and A. N. Philippou. The odds in some odd-even games. The American Mathematical Monthly, 82:646--648, 1975.Google ScholarGoogle ScholarCross RefCross Ref
  21. T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. Tracking web spam with html style similarities. TWEB, 2(1), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient estimation for high similarities using odd sketches

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WWW '14: Proceedings of the 23rd international conference on World wide web
      April 2014
      926 pages
      ISBN:9781450327442
      DOI:10.1145/2566486

      Copyright © 2014 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 7 April 2014

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      WWW '14 Paper Acceptance Rate84of645submissions,13%Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader