Abstract
As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this article, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a new filter, called the segment filter. We partition a string into a set of segments and use the segments as a filter to find similar string pairs. We first create inverted indices for the segments. Then for each string, we select some of its substrings, identify the selected substrings from the inverted indices, and take strings on the inverted lists of the found substrings as candidates of this string. Finally, we verify the candidates to generate the final answer. We devise efficient techniques to select substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidates. We also extend our techniques to support normalized edit distance. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real-world datasets.
Supplemental Material
Available for Download
Supplemental movie, appendix, image and software files for, A partition-based method for string similarity joins with edit-distance constraints
- Agrawal, S., Chakrabarti, K., Chaudhuri, S., and Ganti, V. 2008. Scalable ad-hoc entity extraction from text collections. Proc. VLDB Endow. 1, 1, 945--957. Google ScholarDigital Library
- Arasu, A., Ganti, V., and Kaushik, R. 2006. Efficient exact set-similarity joins. In Proceedings of the International Conference on Very Large Databases. 918--929. Google ScholarDigital Library
- Bayardo, R. J., Ma, Y., and Srikant, R. 2007. Scaling up all pairs similarity search. In Proceedings of the International World Wide Web Conference. 131--140. Google ScholarDigital Library
- Behm, A., Ji, S., Li, C., and Lu, J. 2009. Space-constrained gram-based indexing for efficient approximate string search. In Proceedings of the International Conference on Data Engineering. 604--615. Google ScholarDigital Library
- Behm, A., Li, C., and Carey, M. J. 2011. Answering approximate string queries on large data sets using external memory. In Proceedings of the International Conference on Data Engineering. 888--899. Google ScholarDigital Library
- Chakrabarti, K., Chaudhuri, S., Ganti, V., and Xin, D. 2008. An efficient filter for approximate membership checking. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 805--818. Google ScholarDigital Library
- Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 313--324. Google ScholarDigital Library
- Chaudhuri, S., Ganti, V., and Kaushik, R. 2006. A primitive operator for similarity joins in data cleaning. In Proceedings of the International Conference on Data Engineering. 5--16. Google ScholarDigital Library
- Deng, D., Li, G., and Feng, J. 2012. An efficient trie-based method for approximate entity extraction with edit-distance constraints. In Proceedings of the International Conference on Data Engineering. 141--152. Google ScholarDigital Library
- Deng, D., Li, G., Feng, J., and Li, W.-S. 2013. Top-k string similarity search with edit-distance constraints. In Proceedings of the International Conference on Data Engineering.Google Scholar
- Feng, J., Wang, J., and Li, G. 2012. Trie-join: a trie-based method for efficient string similarity joins. VLDB J. 21, 4, 437--461. Google ScholarDigital Library
- Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the International Conference on Very Large Databases. 491--500. Google ScholarDigital Library
- Hadjieleftheriou, M., Chandel, A., Koudas, N., and Srivastava, D. 2008a. Fast indexes and algorithms for set similarity selection queries. In Proceedings of the International Conference on Data Engineering. 267--276. Google ScholarDigital Library
- Hadjieleftheriou, M., Koudas, N., and Srivastava, D. 2009. Incremental maintenance of length normalized indexes for approximate string matching. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 429--440. Google ScholarDigital Library
- Hadjieleftheriou, M. and Li, C. 2009. Efficient approximate search on string collections. Proc. VLDB Endow. 2, 2, 1660--1661. Google ScholarDigital Library
- Hadjieleftheriou, M., Yu, X., Koudas, N., and Srivastava, D. 2008b. Hashed samples: selectivity estimators for set similarity selection queries. Proc. VLDB Endow. 1, 1, 201--212. Google ScholarDigital Library
- Jacox, E. H. and Samet, H. 2008. Metric space similarity joins. ACM Trans. Datab. Syst. 33, 2. Google ScholarDigital Library
- Jestes, J., Li, F., Yan, Z., and Yi, K. 2010. Probabilistic string similarity joins. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 327--338. Google ScholarDigital Library
- Jin, L., Li, C., and Vernica, R. 2008. Sepia: estimating selectivities of approximate string predicates in large databases. VLDB J. 17, 5, 1213--1229. Google ScholarDigital Library
- Lee, H., Ng, R. T., and Shim, K. 2007. Extending q-grams to estimate selectivity of string matching with low edit distance. In Proceedings of the International Conference on Very Large Databases. 195--206. Google ScholarDigital Library
- Lee, H., Ng, R. T., and Shim, K. 2009. Power-law based estimation of set similarity join size. Proc. VLDB Endow. 2, 1, 658--669. Google ScholarDigital Library
- Lee, H., Ng, R. T., and Shim, K. 2011. Similarity join size estimation using locality sensitive hashing. Proc. VLDB Endow. 4, 6, 338--349. Google ScholarDigital Library
- Li, C., Lu, J., and Lu, Y. 2008. Efficient merging and filtering algorithms for approximate string searches. In Proceedings of the International Conference on Data Engineering. 257--266. Google ScholarDigital Library
- Li, G., Deng, D., and Feng, J. 2011a. Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 529--540. Google ScholarDigital Library
- Li, G., Deng, D., Wang, J., and Feng, J. 2011b. Pass-join: A partition-based method for similarity joins. Proc. VLDB Endow. 5, 3, 253--264. Google ScholarDigital Library
- Li, G., Feng, J., and Li, C. 2013. Supporting search-as-you-type using sql in databases. IEEE Trans. Knowl. Data Eng. 25, 2, 461--475. Google ScholarDigital Library
- Li, G., Ji, S., Li, C., and Feng, J. 2011c. Efficient fuzzy full-text type-ahead search. VLDB J. 20, 4, 617--640. Google ScholarDigital Library
- Navarro, G. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88. Google ScholarDigital Library
- Qin, J., Wang, W., Lu, Y., Xiao, C., and Lin, X. 2011. Efficient exact edit similarity query processing with the asymmetric signature scheme. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1033--1044. Google ScholarDigital Library
- Sarawagi, S. and Kirpal, A. 2004. Efficient set joins on similarity predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 743--754. Google ScholarDigital Library
- Silva, Y. N., Aref, W. G., and Ali, M. H. 2010. The similarity join database operator. In Proceedings of the International Conference on Data Engineering. 892--903.Google Scholar
- Sun, C. and Naughton, J. F. 2011. The token distribution filter for approximate string membership. In Proceedings of the International Workshop on Web and Databases.Google Scholar
- Ukkonen, E. 1985. Algorithms for approximate string matching. Inf. Control 64, 1-3, 100--118. Google ScholarDigital Library
- Vernica, R., Carey, M. J., and Li, C. 2010. Efficient parallel set-similarity joins using mapreduce. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 495--506. Google ScholarDigital Library
- Wang, J., Li, G., and Feng, J. 2010. Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. Proc. VLDB Endow. 3, 1, 1219--1230. Google ScholarDigital Library
- Wang, J., Li, G., and Feng, J. 2011. Fast-join: An efficient method for fuzzy token matching based string similarity join. In Proceedings of the International Conference on Data Engineering. 458--469. Google ScholarDigital Library
- Wang, J., Li, G., and Feng, J. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 85--96. Google ScholarDigital Library
- Wang, W., Xiao, C., Lin, X., and Zhang, C. 2009. Efficient approximate entity extraction with edit distance constraints. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 759--770. Google ScholarDigital Library
- Xiao, C., Wang, W., and Lin, X. 2008a. Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1, 1, 933--944. Google ScholarDigital Library
- Xiao, C., Wang, W., Lin, X., and Shang, H. 2009. Top-k set similarity joins. In Proceedings of the International Conference on Data Engineering. 916--927. Google ScholarDigital Library
- Xiao, C., Wang, W., Lin, X., and Yu, J. X. 2008b. Efficient similarity joins for near duplicate detection. In Proceedings of the International World Wide Web Conference. 131--140. Google ScholarDigital Library
- Yang, X., Wang, B., and Li, C. 2008. Cost-based variable-length-gram selection for string collections to support approximate queries efficiently. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 353--364. Google ScholarDigital Library
- Zhang, Z., Hadjieleftheriou, M., Ooi, B. C., and Srivastava, D. 2010. Bed-tree: An all-purpose index structure for string similarity search based on edit distance. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 915--926. Google ScholarDigital Library
Index Terms
- A partition-based method for string similarity joins with edit-distance constraints
Recommendations
The string edit distance matching problem with moves
The edit distance between two strings S and R is defined to be the minimum number of character inserts, deletes, and changes needed to convert R to S. Given a text string t of length n, and a pattern string p of length m, informally, the string edit ...
Edit distance for a run-length-encoded string and an uncompressed string
We propose a new algorithm for computing the edit distance of an uncompressed string against a run-length-encoded string. For an uncompressed string of length n and a compressed string with M runs, the algorithm computes their edit distance in time O(Mn)...
An effective candidate generation method for improving performance of edit similarity query processing
In this paper, we study edit similarity query processing to find strings similar to a query string from a collection of strings. To solve the problem, many algorithms have been proposed under a filter-and-verification framework, where candidate strings ...
Comments