research-article

A partition-based method for string similarity joins with edit-distance constraints

Authors:
Guoliang Li

Tsinghua University, China

Tsinghua University, China
View Profile

,
Dong Deng

Tsinghua University, China

Tsinghua University, China
View Profile

,
Jianhua Feng

Tsinghua University, China

Tsinghua University, China
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 38 Issue 2Article No.: 9pp 1–33https://doi.org/10.1145/2487259.2487261

Published:04 July 2013Publication History

ACM Transactions on Database Systems

Abstract

As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this article, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a new filter, called the segment filter. We partition a string into a set of segments and use the segments as a filter to find similar string pairs. We first create inverted indices for the segments. Then for each string, we select some of its substrings, identify the selected substrings from the inverted indices, and take strings on the inverted lists of the found substrings as candidates of this string. Finally, we verify the candidates to generate the final answer. We devise efficient techniques to select substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidates. We also extend our techniques to support normalized edit distance. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real-world datasets.

Supplemental Material

Available for Download

zip

li.zip (158.8 KB)

Supplemental movie, appendix, image and software files for, A partition-based method for string similarity joins with edit-distance constraints

References

Agrawal, S., Chakrabarti, K., Chaudhuri, S., and Ganti, V. 2008. Scalable ad-hoc entity extraction from text collections. Proc. VLDB Endow. 1, 1, 945--957. Google ScholarDigital Library
Arasu, A., Ganti, V., and Kaushik, R. 2006. Efficient exact set-similarity joins. In Proceedings of the International Conference on Very Large Databases. 918--929. Google ScholarDigital Library
Bayardo, R. J., Ma, Y., and Srikant, R. 2007. Scaling up all pairs similarity search. In Proceedings of the International World Wide Web Conference. 131--140. Google ScholarDigital Library
Behm, A., Ji, S., Li, C., and Lu, J. 2009. Space-constrained gram-based indexing for efficient approximate string search. In Proceedings of the International Conference on Data Engineering. 604--615. Google ScholarDigital Library
Behm, A., Li, C., and Carey, M. J. 2011. Answering approximate string queries on large data sets using external memory. In Proceedings of the International Conference on Data Engineering. 888--899. Google ScholarDigital Library
Chakrabarti, K., Chaudhuri, S., Ganti, V., and Xin, D. 2008. An efficient filter for approximate membership checking. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 805--818. Google ScholarDigital Library
Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 313--324. Google ScholarDigital Library
Chaudhuri, S., Ganti, V., and Kaushik, R. 2006. A primitive operator for similarity joins in data cleaning. In Proceedings of the International Conference on Data Engineering. 5--16. Google ScholarDigital Library
Deng, D., Li, G., and Feng, J. 2012. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In Proceedings of the International Conference on Data Engineering. 141--152. Google ScholarDigital Library
Deng, D., Li, G., Feng, J., and Li, W.-S. 2013. Top-k string similarity search with edit-distance constraints. In Proceedings of the International Conference on Data Engineering.Google Scholar
Feng, J., Wang, J., and Li, G. 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21, 4, 437--461. Google ScholarDigital Library
Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the International Conference on Very Large Databases. 491--500. Google ScholarDigital Library
Hadjieleftheriou, M., Chandel, A., Koudas, N., and Srivastava, D. 2008a. Fast indexes and algorithms for set similarity selection queries. In Proceedings of the International Conference on Data Engineering. 267--276. Google ScholarDigital Library
Hadjieleftheriou, M., Koudas, N., and Srivastava, D. 2009. Incremental maintenance of length normalized indexes for approximate string matching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 429--440. Google ScholarDigital Library
Hadjieleftheriou, M. and Li, C. 2009. Efficient approximate search on string collections. Proc. VLDB Endow. 2, 2, 1660--1661. Google ScholarDigital Library
Hadjieleftheriou, M., Yu, X., Koudas, N., and Srivastava, D. 2008b. Hashed samples: selectivity estimators for set similarity selection queries. Proc. VLDB Endow. 1, 1, 201--212. Google ScholarDigital Library
Jacox, E. H. and Samet, H. 2008. Metric space similarity joins. ACM Trans. Datab. Syst. 33, 2. Google ScholarDigital Library
Jestes, J., Li, F., Yan, Z., and Yi, K. 2010. Probabilistic string similarity joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 327--338. Google ScholarDigital Library
Jin, L., Li, C., and Vernica, R. 2008. Sepia: estimating selectivities of approximate string predicates in large databases. VLDB J. 17, 5, 1213--1229. Google ScholarDigital Library
Lee, H., Ng, R. T., and Shim, K. 2007. Extending q-grams to estimate selectivity of string matching with low edit distance. In Proceedings of the International Conference on Very Large Databases. 195--206. Google ScholarDigital Library
Lee, H., Ng, R. T., and Shim, K. 2009. Power-law based estimation of set similarity join size. Proc. VLDB Endow. 2, 1, 658--669. Google ScholarDigital Library
Lee, H., Ng, R. T., and Shim, K. 2011. Similarity join size estimation using locality sensitive hashing. Proc. VLDB Endow. 4, 6, 338--349. Google ScholarDigital Library
Li, C., Lu, J., and Lu, Y. 2008. Efficient merging and filtering algorithms for approximate string searches. In Proceedings of the International Conference on Data Engineering. 257--266. Google ScholarDigital Library
Li, G., Deng, D., and Feng, J. 2011a. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 529--540. Google ScholarDigital Library
Li, G., Deng, D., Wang, J., and Feng, J. 2011b. Pass-join: A partition-based method for similarity joins. Proc. VLDB Endow. 5, 3, 253--264. Google ScholarDigital Library
Li, G., Feng, J., and Li, C. 2013. Supporting search-as-you-type using sql in databases. IEEE Trans. Knowl. Data Eng. 25, 2, 461--475. Google ScholarDigital Library
Li, G., Ji, S., Li, C., and Feng, J. 2011c. Efficient fuzzy full-text type-ahead search. VLDB J. 20, 4, 617--640. Google ScholarDigital Library
Navarro, G. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88. Google ScholarDigital Library
Qin, J., Wang, W., Lu, Y., Xiao, C., and Lin, X. 2011. Efficient exact edit similarity query processing with the asymmetric signature scheme. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1033--1044. Google ScholarDigital Library
Sarawagi, S. and Kirpal, A. 2004. Efficient set joins on similarity predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 743--754. Google ScholarDigital Library
Silva, Y. N., Aref, W. G., and Ali, M. H. 2010. The similarity join database operator. In Proceedings of the International Conference on Data Engineering. 892--903.Google Scholar
Sun, C. and Naughton, J. F. 2011. The token distribution filter for approximate string membership. In Proceedings of the International Workshop on Web and Databases.Google Scholar
Ukkonen, E. 1985. Algorithms for approximate string matching. Inf. Control 64, 1-3, 100--118. Google ScholarDigital Library
Vernica, R., Carey, M. J., and Li, C. 2010. Efficient parallel set-similarity joins using mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 495--506. Google ScholarDigital Library
Wang, J., Li, G., and Feng, J. 2010. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Proc. VLDB Endow. 3, 1, 1219--1230. Google ScholarDigital Library
Wang, J., Li, G., and Feng, J. 2011. Fast-join: An efficient method for fuzzy token matching based string similarity join. In Proceedings of the International Conference on Data Engineering. 458--469. Google ScholarDigital Library
Wang, J., Li, G., and Feng, J. 2012. Can we beat the prefix filtering&quest;: an adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 85--96. Google ScholarDigital Library
Wang, W., Xiao, C., Lin, X., and Zhang, C. 2009. Efficient approximate entity extraction with edit distance constraints. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 759--770. Google ScholarDigital Library
Xiao, C., Wang, W., and Lin, X. 2008a. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1, 1, 933--944. Google ScholarDigital Library
Xiao, C., Wang, W., Lin, X., and Shang, H. 2009. Top-k set similarity joins. In Proceedings of the International Conference on Data Engineering. 916--927. Google ScholarDigital Library
Xiao, C., Wang, W., Lin, X., and Yu, J. X. 2008b. Efficient similarity joins for near duplicate detection. In Proceedings of the International World Wide Web Conference. 131--140. Google ScholarDigital Library
Yang, X., Wang, B., and Li, C. 2008. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 353--364. Google ScholarDigital Library
Zhang, Z., Hadjieleftheriou, M., Ooi, B. C., and Srivastava, D. 2010. Bed-tree: An all-purpose index structure for string similarity search based on edit distance. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 915--926. Google ScholarDigital Library

Index Terms

A partition-based method for string similarity joins with edit-distance constraints
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

The string edit distance matching problem with moves

The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes, and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit ...
Read More
Edit distance for a run-length-encoded string and an uncompressed string

We propose a new algorithm for computing the edit distance of an uncompressed string against a run-length-encoded string. For an uncompressed string of length n and a compressed string with M runs, the algorithm computes their edit distance in time O(Mn)...
Read More
An effective candidate generation method for improving performance of edit similarity query processing

In this paper, we study edit similarity query processing to find strings similar to a query string from a collection of strings. To solve the problem, many algorithms have been proposed under a filter-and-verification framework, where candidate strings ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 38, Issue 2
June 2013
245 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2487259
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 July 2013
- Accepted: 1 February 2013
- Revised: 1 November 2012
- Received: 1 June 2012
Published in tods Volume 38, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
String similarity join
edit distance
segment filter
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 607
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

The string edit distance matching problem with moves

Edit distance for a run-length-encoded string and an uncompressed string

An effective candidate generation method for improving performance of edit similarity query processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A partition-based method for string similarity joins with edit-distance constraints

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

The string edit distance matching problem with moves

Edit distance for a run-length-encoded string and an uncompressed string

An effective candidate generation method for improving performance of edit similarity query processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media