skip to main content
10.1145/1376916.1376946acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

The power of two min-hashes for similarity search among hierarchical data objects

Published: 09 June 2008 Publication History

Abstract

In this study we propose sketching algorithms for computing similarities between hierarchical data. Specifically, we look at data objects that are represented using leaf-labeled trees denoting a set of elements at the leaves organized in a hierarchy. Such representations are richer alternatives to a set. For example, a document can be represented as a hierarchy of sets wherein chapters, sections, and paragraphs represent different levels in the hierarchy. Such a representation is richer than viewing the document simply as a set of words. We measure distance between trees using the best possible super-imposition that minimizes the number of mismatched leaf labels. Our distance measure is equivalent to an Earth Mover's Distance measure since the leaf-labeled trees of height one can be viewed as sets and can be recursively extended to trees of larger height by viewing them as set of sets. We compute sketches of arbitrary weighted trees and analyze them in the context of locality-sensitive hashing (LSH) where the probability of two sketches matching is high when two trees are similar and low when the two trees are far under the given distance measure. Specifically, we compute sketches of such trees by propagating min-hash computations up the tree. Furthermore, we show that propagating one min-hash results in poor sketch properties while propagating two min-hashes results in good sketches.

References

[1]
N. Augsten, M. Böhlen, C. Dyreson, and J. Gamper. Approximate joins for data-centric XML. In Proceedings of the International Conference on Data Engineering (ICDE), Cancún, Mexico, April 2008. To appear.
[2]
N. Augsten, M. Böhlen, and J. Gamper. Approximate matching of hierarchical data using pq-grams. In Proc. of the 31st VLDB Conference, pages 301--312, 2005.
[3]
Andrei Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of SEQUENCES SEQS: Sequences '91, 1998.
[4]
Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630--659, 2000.
[5]
Moses Charikar. Similarity estimation techniques from rounding algorithms. In Proc. 34th Annual ACM Symposium on Theory of Computing, pages 380--388, 2002.
[6]
S. S. Chawathe, A. Rajaraman, H. Garcia-Molina, and J. Widom. Change detection in hierarchically structured information. In Proc. of the ACM SIGMOD Intl. Conf. on Management of Data, pages 493--504, 1996.
[7]
W. Chen. New algorithms for ordered tree-to-tree correction problem. Journal of Algorithms, 40(2):135--158, August 2001.
[8]
Mayur Datar, Nicole Immorlica, Piotr Indyk, and Vahab S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In Proc. of the 20th ACM Symposium on Computational Geometry, pages 253--262, 2004.
[9]
M. Garofalakis and A. Kumar. Xml stream processing using tree-edit distance embeddings. ACM Transactions on Database Systems, 30(1):279--332, 2005.
[10]
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. of 25th International Conference on Very Large Data Bases, VLDB, pages 518--529, 1999.
[11]
S. Gollapudi and R. Panigrahy. Exploiting asymmetry in hierarchical topic extraction. In Proc. of 13th Conference on Information and Knowledge Management, 2006.
[12]
Taher H. Haveliwala, Aristides Gionis, and Piotr Indyk. Scalable techniques for clustering the web. In WebDB (Informal Proceedings), pages 129--134, 2000.
[13]
Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. of 30th ACM Symposium on Theory of Computing (STOC), pages 604--613, 1998.
[14]
T. Jiang, L. Wang, and K. Zhang. Alignment of trees - an alternative to tree edit. In Proc. Intl Conference on Combinatorial Pattern Matching, pages 75--86, 1994.
[15]
K. Kailing, H-P. Kriegel, S. Schonauer, and T. Seidl. Efficient similarity search for hierarchical data in large databases. In Proc. 9th Intl Conference on Extending Database Technology, pages 676--693, 2004.
[16]
T. Margush and F. R. McMorris. Consensus n-trees. Bulletin of Mathematical Biology, 3:239--244, 1981.
[17]
Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, pages 218--224, 2005.
[18]
Rina Panigrahy. Entropy based nearest neighbor search in high dimensions. In Proc. of 17th Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2006, pages 1186--1195, 2006.
[19]
K. Zhang. A constrained editing distance between unordered labeled trees. Algorithmica, 15:205--222, 1996.
[20]
K. Zhang and D. Shasha. Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing, 18(6):1245--1262, 1989.
[21]
Li Zhang. On matching nodes between trees. Technical Report HPL-2003-67, HP Laboratories, Palo Alto, CA, April 2003.

Cited By

View all
  • (2024)Fast Comparative Analysis of Merge Trees Using Locality Sensitive HashingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345638331:1(141-151)Online publication date: 12-Sep-2024
  • (2018)Hashing for Adaptive Real-Time Graph Stream Classification With Concept DriftsIEEE Transactions on Cybernetics10.1109/TCYB.2017.270897948:5(1591-1604)Online publication date: May-2018
  • (2018)Earth Mover’s Distance Between Rooted Labeled Unordered Trees Formulated from Complete SubtreesPattern Recognition Applications and Methods10.1007/978-3-030-05499-1_4(65-88)Online publication date: 16-Jan-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2008
330 pages
ISBN:9781605581521
DOI:10.1145/1376916
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. earth movers distance
  2. locality sensitive hashing
  3. similarity

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '08
Sponsor:

Acceptance Rates

PODS '08 Paper Acceptance Rate 28 of 159 submissions, 18%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Fast Comparative Analysis of Merge Trees Using Locality Sensitive HashingIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345638331:1(141-151)Online publication date: 12-Sep-2024
  • (2018)Hashing for Adaptive Real-Time Graph Stream Classification With Concept DriftsIEEE Transactions on Cybernetics10.1109/TCYB.2017.270897948:5(1591-1604)Online publication date: May-2018
  • (2018)Earth Mover’s Distance Between Rooted Labeled Unordered Trees Formulated from Complete SubtreesPattern Recognition Applications and Methods10.1007/978-3-030-05499-1_4(65-88)Online publication date: 16-Jan-2018
  • (2013)Stratification driven placement of complex dataProceedings of the 2013 IEEE International Conference on Data Engineering (ICDE 2013)10.1109/ICDE.2013.6544868(709-720)Online publication date: 8-Apr-2013

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media