ABSTRACT
Driven by the increasing demands from applications such as data cleansing, integration, and bioinformatics, approximate string matching queries have gain much attention recently. In this paper, we present the design and implementation of a trie-based system which supports both string similarity search and join based on our recent work [23].
- R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. Google ScholarDigital Library
- S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. Google ScholarDigital Library
- S. Chaudhuri and R. Kaushik. Extending autocompletion to tolerate errors. In SIGMOD Conference, 2009. Google ScholarDigital Library
- J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4):437--461, 2012. Google ScholarDigital Library
- L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. Google ScholarDigital Library
- M. Hadjieleftheriou and C. Li. Efficient approximate search on string collections. PVLDB, 2(2):1660--1661, 2009. Google ScholarDigital Library
- M. Hadjieleftheriou and D. Srivastava. Approximate string processing. Foundations and Trends in Databases, 2(4):267--402, 2011. Google ScholarDigital Library
- T. Kahveci and A. K. Singh. Efficient index structures for string databases. In VLDB, pages 351--360, 2001. Google ScholarDigital Library
- G. Li, S. Ji, C. Li, and J. Feng. Efficient fuzzy full-text type-ahead search. The VLDB Journal, 20(4):617--640, 2011. Google ScholarDigital Library
- X. Lin and W. Wang. Set and string similarity queries: A survey. Chinese Journal of Computers, (10):1853--1862, 2011.Google Scholar
- W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18--31, 1980.Google ScholarCross Ref
- J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD, 2011. Google ScholarDigital Library
- D. Sokol, G. Benson, and J. Tojeira. Tandem repeats over the edit distance. Bioinformatics, 23(2):30--35, 2007. Google ScholarDigital Library
- B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, April 2007. http://fastss.csg.uzh.ch/.Google Scholar
- E. Ukkonen. Algorithms for approximate string matching. Information and Control, 64(1-3):100--118, 1985. Google ScholarDigital Library
- J. Wang, J. Feng, and G. Li. Trie-join: Efficient trie-based string similarity joins with edit. In VLDB, 2010. Google ScholarDigital Library
- W. Wang. Similarity joins as stronger metric operations. SIGSPATIAL Special, 2:24--27, July 2010. Google ScholarDigital Library
- W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. VChunkJoin: An efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng., 2012.Google Scholar
- W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit constraints. In SIMGOD, 2009. Google ScholarDigital Library
- C. Xiao, J. Qin, W. Wang, Y. Ishikawa, K. Tsuda, and K. Sadakane. Efficient error-tolerant query autocompletion. Proceedings of the VLDB Endowment, 2013.Google ScholarDigital Library
- C. Xiao, W. Wang, and X. Lin. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008. Google ScholarDigital Library
- C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008. Google ScholarDigital Library
- X. Zhou, J. Qin, C. Xiao, W. Wang, X. Lin, and Y. Ishikawa. LEVA: An efficient query processing algorithm for error tolerant autocompletion. Submitted for publication.Google Scholar
Index Terms
- Trie-based similarity search and join
Recommendations
Trie-join: efficient trie-based string similarity joins with edit-distance constraints
A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string ...
Trie-join: a trie-based method for efficient string similarity joins
A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string ...
String similarity search and join: a survey
String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-...
Comments