skip to main content
research-article

A partition-based method for string similarity joins with edit-distance constraints

Published:04 July 2013Publication History
Skip Abstract Section

Abstract

As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this article, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a new filter, called the segment filter. We partition a string into a set of segments and use the segments as a filter to find similar string pairs. We first create inverted indices for the segments. Then for each string, we select some of its substrings, identify the selected substrings from the inverted indices, and take strings on the inverted lists of the found substrings as candidates of this string. Finally, we verify the candidates to generate the final answer. We devise efficient techniques to select substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidates. We also extend our techniques to support normalized edit distance. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real-world datasets.

Skip Supplemental Material Section

Supplemental Material

References

  1. Agrawal, S., Chakrabarti, K., Chaudhuri, S., and Ganti, V. 2008. Scalable ad-hoc entity extraction from text collections. Proc. VLDB Endow. 1, 1, 945--957. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Arasu, A., Ganti, V., and Kaushik, R. 2006. Efficient exact set-similarity joins. In Proceedings of the International Conference on Very Large Databases. 918--929. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bayardo, R. J., Ma, Y., and Srikant, R. 2007. Scaling up all pairs similarity search. In Proceedings of the International World Wide Web Conference. 131--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Behm, A., Ji, S., Li, C., and Lu, J. 2009. Space-constrained gram-based indexing for efficient approximate string search. In Proceedings of the International Conference on Data Engineering. 604--615. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Behm, A., Li, C., and Carey, M. J. 2011. Answering approximate string queries on large data sets using external memory. In Proceedings of the International Conference on Data Engineering. 888--899. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chakrabarti, K., Chaudhuri, S., Ganti, V., and Xin, D. 2008. An efficient filter for approximate membership checking. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 805--818. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 313--324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chaudhuri, S., Ganti, V., and Kaushik, R. 2006. A primitive operator for similarity joins in data cleaning. In Proceedings of the International Conference on Data Engineering. 5--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Deng, D., Li, G., and Feng, J. 2012. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In Proceedings of the International Conference on Data Engineering. 141--152. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Deng, D., Li, G., Feng, J., and Li, W.-S. 2013. Top-k string similarity search with edit-distance constraints. In Proceedings of the International Conference on Data Engineering.Google ScholarGoogle Scholar
  11. Feng, J., Wang, J., and Li, G. 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21, 4, 437--461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the International Conference on Very Large Databases. 491--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Hadjieleftheriou, M., Chandel, A., Koudas, N., and Srivastava, D. 2008a. Fast indexes and algorithms for set similarity selection queries. In Proceedings of the International Conference on Data Engineering. 267--276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Hadjieleftheriou, M., Koudas, N., and Srivastava, D. 2009. Incremental maintenance of length normalized indexes for approximate string matching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 429--440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hadjieleftheriou, M. and Li, C. 2009. Efficient approximate search on string collections. Proc. VLDB Endow. 2, 2, 1660--1661. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Hadjieleftheriou, M., Yu, X., Koudas, N., and Srivastava, D. 2008b. Hashed samples: selectivity estimators for set similarity selection queries. Proc. VLDB Endow. 1, 1, 201--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Jacox, E. H. and Samet, H. 2008. Metric space similarity joins. ACM Trans. Datab. Syst. 33, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Jestes, J., Li, F., Yan, Z., and Yi, K. 2010. Probabilistic string similarity joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 327--338. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Jin, L., Li, C., and Vernica, R. 2008. Sepia: estimating selectivities of approximate string predicates in large databases. VLDB J. 17, 5, 1213--1229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Lee, H., Ng, R. T., and Shim, K. 2007. Extending q-grams to estimate selectivity of string matching with low edit distance. In Proceedings of the International Conference on Very Large Databases. 195--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lee, H., Ng, R. T., and Shim, K. 2009. Power-law based estimation of set similarity join size. Proc. VLDB Endow. 2, 1, 658--669. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lee, H., Ng, R. T., and Shim, K. 2011. Similarity join size estimation using locality sensitive hashing. Proc. VLDB Endow. 4, 6, 338--349. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Li, C., Lu, J., and Lu, Y. 2008. Efficient merging and filtering algorithms for approximate string searches. In Proceedings of the International Conference on Data Engineering. 257--266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Li, G., Deng, D., and Feng, J. 2011a. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 529--540. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Li, G., Deng, D., Wang, J., and Feng, J. 2011b. Pass-join: A partition-based method for similarity joins. Proc. VLDB Endow. 5, 3, 253--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Li, G., Feng, J., and Li, C. 2013. Supporting search-as-you-type using sql in databases. IEEE Trans. Knowl. Data Eng. 25, 2, 461--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Li, G., Ji, S., Li, C., and Feng, J. 2011c. Efficient fuzzy full-text type-ahead search. VLDB J. 20, 4, 617--640. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Navarro, G. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Qin, J., Wang, W., Lu, Y., Xiao, C., and Lin, X. 2011. Efficient exact edit similarity query processing with the asymmetric signature scheme. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1033--1044. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sarawagi, S. and Kirpal, A. 2004. Efficient set joins on similarity predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 743--754. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Silva, Y. N., Aref, W. G., and Ali, M. H. 2010. The similarity join database operator. In Proceedings of the International Conference on Data Engineering. 892--903.Google ScholarGoogle Scholar
  32. Sun, C. and Naughton, J. F. 2011. The token distribution filter for approximate string membership. In Proceedings of the International Workshop on Web and Databases.Google ScholarGoogle Scholar
  33. Ukkonen, E. 1985. Algorithms for approximate string matching. Inf. Control 64, 1-3, 100--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Vernica, R., Carey, M. J., and Li, C. 2010. Efficient parallel set-similarity joins using mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 495--506. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Wang, J., Li, G., and Feng, J. 2010. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Proc. VLDB Endow. 3, 1, 1219--1230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Wang, J., Li, G., and Feng, J. 2011. Fast-join: An efficient method for fuzzy token matching based string similarity join. In Proceedings of the International Conference on Data Engineering. 458--469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Wang, J., Li, G., and Feng, J. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 85--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Wang, W., Xiao, C., Lin, X., and Zhang, C. 2009. Efficient approximate entity extraction with edit distance constraints. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 759--770. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Xiao, C., Wang, W., and Lin, X. 2008a. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1, 1, 933--944. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Xiao, C., Wang, W., Lin, X., and Shang, H. 2009. Top-k set similarity joins. In Proceedings of the International Conference on Data Engineering. 916--927. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Xiao, C., Wang, W., Lin, X., and Yu, J. X. 2008b. Efficient similarity joins for near duplicate detection. In Proceedings of the International World Wide Web Conference. 131--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yang, X., Wang, B., and Li, C. 2008. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 353--364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Zhang, Z., Hadjieleftheriou, M., Ooi, B. C., and Srivastava, D. 2010. Bed-tree: An all-purpose index structure for string similarity search based on edit distance. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 915--926. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A partition-based method for string similarity joins with edit-distance constraints

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Database Systems
        ACM Transactions on Database Systems  Volume 38, Issue 2
        June 2013
        245 pages
        ISSN:0362-5915
        EISSN:1557-4644
        DOI:10.1145/2487259
        Issue’s Table of Contents

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 4 July 2013
        • Accepted: 1 February 2013
        • Revised: 1 November 2012
        • Received: 1 June 2012
        Published in tods Volume 38, Issue 2

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader