skip to main content
10.1145/2723372.2723733acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Efficient Similarity Join and Search on Multi-Attribute Data

Published:27 May 2015Publication History

ABSTRACT

In this paper we study similarity join and search on multi- attribute data. Traditional methods on single-attribute data have pruning power only on single attributes and cannot efficiently support multi-attribute data. To address this problem, we propose a prefix tree index which has holis- tic pruning ability on multiple attributes. We propose a cost model to quantify the prefix tree which can guide the prefix tree construction. Based on the prefix tree, we devise a filter-verification framework to support similarity search and join on multi-attribute data. The filter step prunes a large number of dissimilar results and identifies some candi- dates using the prefix tree and the verification step verifies the candidates to generate the final answer. For similar- ity join, we prove that constructing an optimal prefix tree is NP-complete and develop a greedy algorithm to achieve high performance. For similarity search, since one prefix tree cannot support all possible search queries, we extend the cost model to support similarity search and devise a budget-based algorithm to construct multiple high-quality prefix trees. We also devise a hybrid verification algorithm to improve the verification step. Experimental results show our method significantly outperforms baseline approaches.

References

  1. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, pages 5--16, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. N. Dalvi, V. Rastogi, A. Dasgupta, A. D. Sarma, and T. Sarlós. Optimal hashing schemes for entity matching. In WWW, pages 295--306, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Deng, G. Li, and J. Feng. A pivotal prefix based filtering algorithm for string similarity search. In SIGMOD Conference, pages 673--684, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Deng, G. Li, J. Feng, and W.-S. Li. Top-k string similarity search with edit-distance constraints. In ICDE, pages 925--936, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Deng, G. Li, S. Hao, J. Wang, and J. Feng. Massjoin: A mapreduce-based method for scalable string similarity joins. In ICDE, pages 340--351, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. Garey and D. Johnson. A guide to the theory of NP-completeness. WH Freeman and Company, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, pages 491--500, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Hadjieleftheriou, N. Koudas, and D. Srivastava. Incremental maintenance of length normalized indexes for approximate string matching. In SIGMOD Conference, pages 429--440, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. M. Hellerstein and M. Stonebraker. Predicate migration: Optimizing queries with expensive predicates. In SIGMOD Conference, pages 267--276, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. A. Hernández and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov., 2(1):9--37, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Jiang, G. Li, J. Feng, and W. Li. String similarity joins: An experimental evaluation. PVLDB, 7(8):625--636, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD Conference, pages 802--803, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Li, J. Lu, and Y. Lu. Efficient merging and filtering algorithms for approximate string searches. In ICDE, pages 257--266, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Li, B. Wang, and X. Yang. Vgram: Improving performance of approximate queries on string collections using variable-length grams. In VLDB, pages 303--314, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. G. Li, D. Deng, and J. Feng. A partition-based method for string similarity joins with edit-distance constraints. ACM Trans. Database Syst., 38(2):9, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Li, D. Deng, J. Wang, and J. Feng. Pass-join: A partition-based method for similarity joins. PVLDB, 5(3):253--264, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In SIGKDD, pages 169--178, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Michelson and C. A. Knoblock. Learning blocking schemes for record linkage. In AAAI, pages 440--445, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD Conference, pages 1033--1044, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. D. Sarma, A. Jain, A. Machanavajjhala, and P. Bohannon. An automatic blocking mechanism for large-scale de-duplication tasks. In CIKM, pages 1055--1064, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Wang, G. Li, D. Deng, Y. Zhang, and J. Feng. Two birds with one stone: An efficient hierarchical framework for top-k and threshold-based string similarity search. In ICDE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  23. J. Wang, G. Li, and J. Feng. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB, 3(1):1219--1230, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Wang, G. Li, and J. Feng. Fast-join: An efficient method for fuzzy token matching based string similarity join. In ICDE, pages 458--469, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Wang, G. Li, and J. Feng. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In SIGMOD Conference, pages 85--96, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. Vchunkjoin: An efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng., 25(8):1916--1929, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina. Entity resolution with iterative blocking. In SIGMOD Conference, pages 219--232, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. C. Xiao, W. Wang, and X. Lin. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, pages 131--140, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Z. Zhang, M. Hadjieleftheriou, B. C. Ooi, and D. Srivastava. Bed-tree: an all-purpose index structure for string similarity search based on edit distance. In SIGMOD, pages 915--926, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient Similarity Join and Search on Multi-Attribute Data

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
        May 2015
        2110 pages
        ISBN:9781450327589
        DOI:10.1145/2723372

        Copyright © 2015 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 May 2015

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '15 Paper Acceptance Rate106of415submissions,26%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader