skip to main content
article

Estimating the selectivity of approximate string queries

Published: 01 June 2007 Publication History

Abstract

Approximate queries on string data are important due to the prevalence of such data in databases and various conventions and errors in string data. We present the VSol estimator, a novel technique for estimating the selectivity of approximate string queries. The VSol estimator is based on inverse strings and makes the performance of the selectivity estimator independent of the number of strings. To get inverse strings we decompose all database strings into overlapping substrings of length q (q-grams) and then associate each q-gram with its inverse string: the IDs of all strings that contain the q-gram. We use signatures to compress inverse strings, and clustering to group similar signatures.
We study our technique analytically and experimentally. The space complexity of our estimator only depends on the number of neighborhoods in the database and the desired estimation error. The time to estimate the selectivity is independent of the number of database strings and linear with respect to the length of query string. We give a detailed empirical performance evaluation of our solution for synthetic and real-world datasets. We show that VSol is effective for large skewed databases of short strings.

References

[1]
Blohsfeld, B., Korus, D., and Seeger, B. 1999. A comparison of selectivity estimators for range queries on metric attributes. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 239--250.
[2]
Broder, A. Z. 1998. On the resemblance and containment of documents. In Proceedings of the (SEQS'91).
[3]
Broder, A. Z. 2000. Identifying and filtering near-duplicate documents. In Proceedings of the 11th Anual Symposium on Combinatorial Pattern Matching. 1--10.
[4]
Chaudhuri, S., Ganti, V., and Gravano, L. 2004. Selectivity estimation for string predicates: Overcoming the underestimation problem. In Proceedings of the International Conference on Data Engineering (ICDE). 227--239.
[5]
Chen, Z., Korn, F., Koudas, N., and Muthukrishnan, S. 2003. Generalized substring selectivity estimation. J. Comput. Syst. Sci. 66, 1, 98--132.
[6]
Cohen, E. 1994. Estimating the size of the transitive closure in linear time. In Proceedings of the 35th Annual Symposium on Foundations of Computer Science (FOCS) (Nov.). 190--200.
[7]
Cohen, E., D., M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J. D., and Yang, C. 2000. Finding interesting associations without support pruning. In Proceedings of the International Conference on Data Engineering (ICDE). 489--499.
[8]
Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Alg. 55, 1, 58--75.
[9]
Frakes, B. and Yates, R. 1992. Information Retrieval Data Structures and Algorithms. Prentice-Hall, Upper Saddle River, NJ.
[10]
Gravano, L., Ipeirotis, P. G., Jagadish, H. V., Koudas, N., Muthukrishnan, S., and Srivastava, D. 2001. Approximate string joins in a database (almost) for free. In Proceedings of the International Conference on Very Large Databases (VLDB). 491--500.
[11]
Hodge, V. J. and Austin, J. 2003. A comparison of standard spell checking algorithms and a novel binary neural approach. IEEE Trans. Knowl. Data Eng. 15, 5, 1073--1081.
[12]
Jagadish, H. V., Kapitskaia, O., Ng, R. T., and Srivastava, D. 2000. One-Dimensional and multi-dimensional substring selectivity estimation. VLDB J. 9, 3, 214--230.
[13]
Jagadish, H. V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K. C., and Suel, T. 1998. Optimal histograms with quality guarantees. In Proceedings of the International Conference on Very Large Databases (VLDB). 275--286.
[14]
Jain, A. and Dubes, R. 1988. Algorithms for Clustering Data. Prentice-Hall, Upper Saddle River, NJ.
[15]
Jain, A. K., Murty, M. N., and Flynn, P. J. 1999. Data clustering: A review. ACM Comput. Surv. 31, 3, 264--323.
[16]
Jin, L., Koudas, N., Li, C., and Tung, A. K. H. 2005. Indexing mixed types for approximate retrieval. In Proceedings of the International Conference on Very Large Databases (VLDB). 793--804.
[17]
Jin, L. and Li, C. 2005. Selectivity estimation for fuzzy string predicates in large data sets. In Proceedings of the International Conference on Very Large Databases (VLDB). 397--408.
[18]
Jin, L., Li, C., and Mehrotra, S. 2003. Efficient record linkage in large data sets. In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA). 137.
[19]
Krishnan, P., Vitter, J. S., and Iyer, B. 1996. Estimating alphanumeric selectivity in the presence of wildcards. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 282--293.
[20]
Kukich, K. 1992. Technique for automatically correcting words in text. ACM Comput. Surv. 24, 4, 377--439.
[21]
Matias, Y., Vitter, J. S., and Wang, M. 1998. Wavelet-based histograms for selectivity estimation. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 448--459.
[22]
Matias, Y., Vitter, J. S., and Wang, M. 2000. Dynamic maintenance of wavelet-based histograms. VLDB J. 101--110.
[23]
Mazeika, A. and Böhlen, M. H. 2006. Cleansing databases of misspelled proper nouns. In Proceedings of the CleanDB Workshop (in conjunction with) the International Conference on Very Large Databases (VLDB).
[24]
Muralikrishna, M. and DeWitt, D. J. 1988. Equi-Depth histograms for estimating selectivity factors for multi-dimensional queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 28--36.
[25]
Navarro, G. 2001. A guided tour to approximate string matching. ACM Comput. Surv. 33, 1, 31--88.
[26]
Poosala, V., Haas, P. J., Ioannidis, Y. E., and Shekita, E. J. 1996. Improved histograms for selectivity estimation of range predicates. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD). 294--305.
[27]
Sahinalp, C., Tasan, M., Macker, J., and Ozsoyoglu, Z. M. 2003. Distance based indexing for string proximity search. In Proceedings of the International Conference on Data Engineering (ICDE). 125--137.
[28]
Salton, G. and McGill, M. J. 1986. Introduction to Modern Information Retrieval. McGraw-Hill, New York.
[29]
Sheu, S., Cheng, C.-Y., and Chang, A. 2005. Fast pattern detection in stream data. In Proceedings of the Conference on Advanced Information Networking and Applications (AINA). 125--130.
[30]
Ukkonen, E. 1983. On approximate string matching. In Proceedings of the Conference on Foundations of Computation Theory (FCT).
[31]
Vernica, R. and Li, C. 2007. Flamingo project. http://www.ics.uci.edu/~flamingo/.

Cited By

View all
  • (2025)Cardinality Estimation of LIKE Predicate Queries using Deep LearningProceedings of the ACM on Management of Data10.1145/37096703:1(1-26)Online publication date: 11-Feb-2025
  • (2023)Context Extraction in Unsupervised Entity Resolution2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00304(1842-1848)Online publication date: 24-Jul-2023
  • (2021)AstridProceedings of the VLDB Endowment10.14778/3436905.343690714:4(471-484)Online publication date: 22-Feb-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Database Systems
ACM Transactions on Database Systems  Volume 32, Issue 2
June 2007
267 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1242524
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2007
Published in TODS Volume 32, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Inverse strings
  2. min-wise hash signatures
  3. q-grams

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 12 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Cardinality Estimation of LIKE Predicate Queries using Deep LearningProceedings of the ACM on Management of Data10.1145/37096703:1(1-26)Online publication date: 11-Feb-2025
  • (2023)Context Extraction in Unsupervised Entity Resolution2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00304(1842-1848)Online publication date: 24-Jul-2023
  • (2021)AstridProceedings of the VLDB Endowment10.14778/3436905.343690714:4(471-484)Online publication date: 22-Feb-2021
  • (2020)QuickSel: Quick Selectivity Learning with Mixture ModelsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389727(1017-1033)Online publication date: 11-Jun-2020
  • (2020)Monotonic Cardinality Estimation of Similarity Selection: A Deep Learning ApproachProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3380570(1197-1212)Online publication date: 11-Jun-2020
  • (2020)Using Positional Sequence Patterns to Estimate the Selectivity of SQL LIKE QueriesExpert Systems with Applications10.1016/j.eswa.2020.113762(113762)Online publication date: Jul-2020
  • (2019)A Hierarchical Index Structure for Region-Aware Spatial Keyword Search with Edit Distance ConstraintDatabase Systems for Advanced Applications10.1007/978-3-030-18579-4_35(591-608)Online publication date: 22-Apr-2019
  • (2018)Keyword Search Mechanisms in Geo-Spatial DatabasesEmerging Trends in Open Source Geographic Information Systems10.4018/978-1-5225-5039-6.ch007(176-194)Online publication date: 2018
  • (2016)Hobbes3: Dynamic generation of variable-length signatures for efficient approximate subsequence mappings2016 IEEE 32nd International Conference on Data Engineering (ICDE)10.1109/ICDE.2016.7498238(169-180)Online publication date: May-2016
  • (2015)GFilter: A General Gram Filter for String Similarity SearchIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2014.234991427:4(1005-1018)Online publication date: 1-Apr-2015
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media