research-article

Trie-based similarity search and join

Authors:
Jianbin Qin

University of New South Wales, Australia

University of New South Wales, Australia
View Profile

,
Xiaoling Zhou

University of New South Wales, Australia

University of New South Wales, Australia
View Profile

,
Wei Wang

University of New South Wales, Australia

University of New South Wales, Australia
View Profile

,
Chuan Xiao

Nagoya University, Japan

Nagoya University, Japan
View Profile

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 WorkshopsMarch 2013Pages 392–396https://doi.org/10.1145/2457317.2457389

Published:18 March 2013Publication History

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops

Pages 392–396

ABSTRACT

Driven by the increasing demands from applications such as data cleansing, integration, and bioinformatics, approximate string matching queries have gain much attention recently. In this paper, we present the design and implementation of a trie-based system which supports both string similarity search and join based on our recent work [23].

References

R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, 2007. Google ScholarDigital Library
S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operator for similarity joins in data cleaning. In ICDE, 2006. Google ScholarDigital Library
S. Chaudhuri and R. Kaushik. Extending autocompletion to tolerate errors. In SIGMOD Conference, 2009. Google ScholarDigital Library
J. Feng, J. Wang, and G. Li. Trie-join: a trie-based method for efficient string similarity joins. VLDB J., 21(4):437--461, 2012. Google ScholarDigital Library
L. Gravano, P. G. Ipeirotis, H. V. Jagadish, N. Koudas, S. Muthukrishnan, and D. Srivastava. Approximate string joins in a database (almost) for free. In VLDB, 2001. Google ScholarDigital Library
M. Hadjieleftheriou and C. Li. Efficient approximate search on string collections. PVLDB, 2(2):1660--1661, 2009. Google ScholarDigital Library
M. Hadjieleftheriou and D. Srivastava. Approximate string processing. Foundations and Trends in Databases, 2(4):267--402, 2011. Google ScholarDigital Library
T. Kahveci and A. K. Singh. Efficient index structures for string databases. In VLDB, pages 351--360, 2001. Google ScholarDigital Library
G. Li, S. Ji, C. Li, and J. Feng. Efficient fuzzy full-text type-ahead search. The VLDB Journal, 20(4):617--640, 2011. Google ScholarDigital Library
X. Lin and W. Wang. Set and string similarity queries: A survey. Chinese Journal of Computers, (10):1853--1862, 2011.Google Scholar
W. J. Masek and M. Paterson. A faster algorithm computing string edit distances. J. Comput. Syst. Sci., 20(1):18--31, 1980.Google ScholarCross Ref
J. Qin, W. Wang, Y. Lu, C. Xiao, and X. Lin. Efficient exact edit similarity query processing with the asymmetric signature scheme. In SIGMOD, 2011. Google ScholarDigital Library
D. Sokol, G. Benson, and J. Tojeira. Tandem repeats over the edit distance. Bioinformatics, 23(2):30--35, 2007. Google ScholarDigital Library
B. S. T. Bocek, E. Hunt. Fast Similarity Search in Large Dictionaries. Technical Report ifi-2007.02, Department of Informatics, University of Zurich, April 2007. http://fastss.csg.uzh.ch/.Google Scholar
E. Ukkonen. Algorithms for approximate string matching. Information and Control, 64(1-3):100--118, 1985. Google ScholarDigital Library
J. Wang, J. Feng, and G. Li. Trie-join: Efficient trie-based string similarity joins with edit. In VLDB, 2010. Google ScholarDigital Library
W. Wang. Similarity joins as stronger metric operations. SIGSPATIAL Special, 2:24--27, July 2010. Google ScholarDigital Library
W. Wang, J. Qin, C. Xiao, X. Lin, and H. T. Shen. VChunkJoin: An efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng., 2012.Google Scholar
W. Wang, C. Xiao, X. Lin, and C. Zhang. Efficient approximate entity extraction with edit constraints. In SIMGOD, 2009. Google ScholarDigital Library
C. Xiao, J. Qin, W. Wang, Y. Ishikawa, K. Tsuda, and K. Sadakane. Efficient error-tolerant query autocompletion. Proceedings of the VLDB Endowment, 2013.Google ScholarDigital Library
C. Xiao, W. Wang, and X. Lin. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB, 1(1):933--944, 2008. Google ScholarDigital Library
C. Xiao, W. Wang, X. Lin, and J. X. Yu. Efficient similarity joins for near duplicate detection. In WWW, 2008. Google ScholarDigital Library
X. Zhou, J. Qin, C. Xiao, W. Wang, X. Lin, and Y. Ishikawa. LEVA: An efficient query processing algorithm for error tolerant autocompletion. Submitted for publication.Google Scholar

Index Terms

Trie-based similarity search and join
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
  2. Information systems applications

Recommendations

Trie-join: efficient trie-based string similarity joins with edit-distance constraints

A string similarity join finds similar pairs between two collections of strings. It is an essential operation in many applications, such as data integration and cleaning, and has attracted significant attention recently. In this paper, we study string ...
Read More
Trie-join: a trie-based method for efficient string similarity joins

A string similarity join finds similar pairs between two collections of strings. Many applications, e.g., data integration and cleaning, can significantly benefit from an efficient string-similarity-join algorithm. In this paper, we study string ...
Read More
String similarity search and join: a survey

String similarity search and join are two important operations in data cleaning and integration, which extend traditional exact search and exact join operations in databases by tolerating the errors and inconsistencies in the data. They have many real-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops
March 2013
423 pages
ISBN:9781450315999
DOI:10.1145/2457317
General Chair:
Giovanna Guerrini
Università di Genova, Italy
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 March 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
EDBT '13 Paper Acceptance Rate7of10submissions,70%Overall Acceptance Rate7of10submissions,70%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 203
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Trie-based similarity search and join

EDBT '13: Proceedings of the Joint EDBT/ICDT 2013 Workshops

ABSTRACT

References

Cited By

Index Terms

Recommendations

Trie-join: efficient trie-based string similarity joins with edit-distance constraints

Trie-join: a trie-based method for efficient string similarity joins

String similarity search and join: a survey