ABSTRACT
Estimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the symmetric difference of the two sets. This means that Odd Sketches provide a highly space-efficient estimator for sets of high similarity, which is relevant in applications such as web duplicate detection, collaborative filtering, and association rule learning. The method extends to weighted Jaccard similarity, relevant e.g. for TF-IDF vector comparison. We present a theoretical analysis of the quality of estimation to guarantee the reliability of Odd Sketch-based estimators. Our experiments confirm this efficiency, and demonstrate the efficiency of Odd Sketches in comparison with $b$-bit minwise hashing schemes on association rule learning and web duplicate detection tasks.
- Y. Bachrach, E. Porat, and J. S. Rosenschein. Sketching techniques for collaborative filtering. In IJCAI, pages 2016--2021, 2009. Google ScholarDigital Library
- Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, pages 1--10. Springer, 2002. Google ScholarDigital Library
- M. Bendersky and W. B. Croft. Finding text reuse on the web. In WSDM, pages 262--271, 2009. Google ScholarDigital Library
- A. Z. Broder. On the resemblance and containment of documents. Sequences, pages 21--29, 1997. Google ScholarDigital Library
- A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630--659, 2000. Google ScholarDigital Library
- A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997. Google ScholarDigital Library
- F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD, pages 219--228, 2009. Google ScholarDigital Library
- E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng., 13(1):64--78, 2001. Google ScholarDigital Library
- A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarDigital Library
- S. Gollapudi and R. Panigrahy. Exploiting asymmetry in hierarchical topic extraction. In CIKM, pages 475--482, 2006. Google ScholarDigital Library
- S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. In WWW, pages 381--390, 2009. Google ScholarDigital Library
- M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291, 2006. Google ScholarDigital Library
- S. Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In ICDM, pages 246--255, 2010. Google ScholarDigital Library
- K. Kutzkov, A. Bifet, F. Bonchi, and A. Gionis. Strip: stream learning of influence probabilities. In KDD, pages 275--283, 2013. Google ScholarDigital Library
- P. Li and A. C. König. b-bit minwise hashing. In WWW, pages 671--680, 2010. Google ScholarDigital Library
- P. Li, A. Shrivastava, J. L. Moore, and A. C. König. Hashing algorithms for large-scale learning. In NIPS, pages 2672--2680, 2011.Google ScholarDigital Library
- M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. MSR-TR-2010--73 technical report, 2010.Google Scholar
- G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In WWW, pages 141--150, 2007. Google ScholarDigital Library
- M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. Google ScholarDigital Library
- E. F. Schuster and A. N. Philippou. The odds in some odd-even games. The American Mathematical Monthly, 82:646--648, 1975.Google ScholarCross Ref
- T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. Tracking web spam with html style similarities. TWEB, 2(1), 2008. Google ScholarDigital Library
Index Terms
- Efficient estimation for high similarities using odd sketches
Recommendations
A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningEstimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been ...
Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrievalEfficient similarity search in high-dimensional spaces is important to content-based retrieval systems. Recent studies have shown that sketches can effectively approximate L1 distance in high-dimensional spaces, and that filtering with sketches can ...
Comments