research-article

Efficient estimation for high similarities using odd sketches

Authors:
Michael Mitzenmacher

Harvard University, Cambridge, MA, USA

Harvard University, Cambridge, MA, USA
View Profile

,
Rasmus Pagh

IT University of Copenhagen, Copenhagen, Denmark

IT University of Copenhagen, Copenhagen, Denmark
View Profile

,
Ninh Pham

IT University of Copenhagen, Copenhagen, Denmark

IT University of Copenhagen, Copenhagen, Denmark
View Profile

WWW '14: Proceedings of the 23rd international conference on World wide webApril 2014Pages 109–118https://doi.org/10.1145/2566486.2568017

Published:07 April 2014Publication History

WWW '14: Proceedings of the 23rd international conference on World wide web

Pages 109–118

ABSTRACT

Estimating set similarity is a central problem in many computer applications. In this paper we introduce the Odd Sketch, a compact binary sketch for estimating the Jaccard similarity of two sets. The exclusive-or of two sketches equals the sketch of the symmetric difference of the two sets. This means that Odd Sketches provide a highly space-efficient estimator for sets of high similarity, which is relevant in applications such as web duplicate detection, collaborative filtering, and association rule learning. The method extends to weighted Jaccard similarity, relevant e.g. for TF-IDF vector comparison. We present a theoretical analysis of the quality of estimation to guarantee the reliability of Odd Sketch-based estimators. Our experiments confirm this efficiency, and demonstrate the efficiency of Odd Sketches in comparison with $b$-bit minwise hashing schemes on association rule learning and web duplicate detection tasks.

References

Y. Bachrach, E. Porat, and J. S. Rosenschein. Sketching techniques for collaborative filtering. In IJCAI, pages 2016--2021, 2009. Google ScholarDigital Library
Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, pages 1--10. Springer, 2002. Google ScholarDigital Library
M. Bendersky and W. B. Croft. Finding text reuse on the web. In WSDM, pages 262--271, 2009. Google ScholarDigital Library
A. Z. Broder. On the resemblance and containment of documents. Sequences, pages 21--29, 1997. Google ScholarDigital Library
A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60(3):630--659, 2000. Google ScholarDigital Library
A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. Computer Networks, 29(8--13):1157--1166, 1997. Google ScholarDigital Library
F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD, pages 219--228, 2009. Google ScholarDigital Library
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. D. Ullman, and C. Yang. Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng., 13(1):64--78, 2001. Google ScholarDigital Library
A. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, pages 271--280, 2007. Google ScholarDigital Library
S. Gollapudi and R. Panigrahy. Exploiting asymmetry in hierarchical topic extraction. In CIKM, pages 475--482, 2006. Google ScholarDigital Library
S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. In WWW, pages 381--390, 2009. Google ScholarDigital Library
M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291, 2006. Google ScholarDigital Library
S. Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In ICDM, pages 246--255, 2010. Google ScholarDigital Library
K. Kutzkov, A. Bifet, F. Bonchi, and A. Gionis. Strip: stream learning of influence probabilities. In KDD, pages 275--283, 2013. Google ScholarDigital Library
P. Li and A. C. König. b-bit minwise hashing. In WWW, pages 671--680, 2010. Google ScholarDigital Library
P. Li, A. Shrivastava, J. L. Moore, and A. C. König. Hashing algorithms for large-scale learning. In NIPS, pages 2672--2680, 2011.Google ScholarDigital Library
M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. MSR-TR-2010--73 technical report, 2010.Google Scholar
G. S. Manku, A. Jain, and A. D. Sarma. Detecting near-duplicates for web crawling. In WWW, pages 141--150, 2007. Google ScholarDigital Library
M. Mitzenmacher and E. Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. Google ScholarDigital Library
E. F. Schuster and A. N. Philippou. The odds in some odd-even games. The American Mathematical Monthly, 82:646--648, 1975.Google ScholarCross Ref
T. Urvoy, E. Chauveau, P. Filoche, and T. Lavergne. Tracking web spam with html style similarities. TWEB, 2(1), 2008. Google ScholarDigital Library

Index Terms

Efficient estimation for high similarities using odd sketches
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets
KDD '19: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Estimating set similarity and detecting highly similar sets are fundamental problems in areas such as databases, machine learning, and information retrieval. MinHash is a well-known technique for approximating Jaccard similarity of sets and has been ...
Read More
Architectural Drawing Using Pencil Sketches and AutoCAD
Read More
Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Efficient similarity search in high-dimensional spaces is important to content-based retrieval systems. Recent studies have shown that sketches can effectively approximate L₁ distance in high-dimensional spaces, and that filtering with sketches can ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '14: Proceedings of the 23rd international conference on World wide web
April 2014
926 pages
ISBN:9781450327442
DOI:10.1145/2566486
General Chair:
Chin-Wan Chung
Korea Advanced Institute of Science and Technology, Korea
,
Program Chairs:
Andrei Broder
Google Inc., USA
,
Kyuseok Shim
Seoul National University, Korea
,
Torsten Suel
New York University, USA
Copyright © 2014 Copyright is held by the International World Wide Web Conference Committee (IW3C2).
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 April 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bloom filters
minwise hashing
set similarity
Qualifiers
- research-article
Conference

Acceptance Rates
WWW '14 Paper Acceptance Rate84of645submissions,13%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 44
  Total Citations
  View Citations
- 755
  Total Downloads
- Downloads (Last 12 months)41
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient estimation for high similarities using odd sketches

WWW '14: Proceedings of the 23rd international conference on World wide web

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Memory-Efficient Sketch Method for Estimating High Similarities in Streaming Sets

Architectural Drawing Using Pencil Sketches and AutoCAD

Asymmetric distance estimation with sketches for similarity search in high-dimensional spaces