ABSTRACT
In this paper we propose a dictionary data structure for string search with errors where the query string may didiffer from the expected matching string by a few edits. This data structure can also be used to find the database string with the longest common prefix with few errors. Specifically, with a database of n random strings, each of length of O(m), we show how to perform string search on a query string that differs from its closest match by k edits using a data structure of linear size and query time equal to Õ(log n 2 log n klog a 2m over 2m). This means that if k < m over log a 2m log n, then the query time is Õ(1). This is of significant in practice as there are several applications where k is small relative to m. Our approach converts strings into bit vectors so that similar strings can map to similar bit vectors with small hamming distance. A simple reduction can be used to obtain similar results for approximate longest prefix search.
- Alexandr Andoni and Piotr Indyk. Efficient algorithms for substring near neighbor problem. In Proceedings of the 17th annual ACM-SIAM symposium on Discrete algorithm, pages 1203--1212, 2006. Google ScholarDigital Library
- Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar. Approximating edit distance efficiently. In Proc. 45th Symposium on Foundations of Computer Science (FOCS 2004), pages 550--559, 2004. Google ScholarDigital Library
- Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970. Google ScholarDigital Library
- A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey, 2002.Google Scholar
- Andrei Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of SEQUENCES SEQS: Sequences '91, 1998. Google ScholarDigital Library
- Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630--659, 2000. Google ScholarDigital Library
- Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In Proceedings of the 36th annual ACM symposium on Theory of computing (STOC), pages 91--100, 2004. Google ScholarDigital Library
- Cristian Estan and George Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., 21(3):270--313, 2003. Google ScholarDigital Library
- Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. of 30th ACM Symposium on Theory of Computing (STOC), pages 604--613, 1998. Google ScholarDigital Library
- Adam Kirsch and Michael Mitzenmacher. Distance-sensitive bloom filters. In Proceedings of ALENEX 06, 2006.Google Scholar
- Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. In Proc. of 30th ACM Symposium on Theory of Computing (STOC), pages 614--623, 1998. Google ScholarDigital Library
- Gad M. Landau and Uzi Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43:239--249, 1986. Google ScholarDigital Library
- M. D. McIllroy. Development of a spelling list. IEEE Transactions on Communications, 30(1):91--99, 1982.Google ScholarCross Ref
- James K. Mullin and Daniel J. Margoliash. A tale of three spelling checkers. Software - Practice and Experience, 20(6):625--630, 1990. Google ScholarDigital Library
- Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001. Google ScholarDigital Library
- Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, pages 218--224, 2005. Google ScholarDigital Library
- H. Shang and T. H. Merrettal. Tries for approximate string matching. IEEE Transactions on Knowledge and Data Engineering, 8(4):540--547, 1996. Google ScholarDigital Library
- Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable Peer-To-Peer lookup service for internet applications. In Proceedings of the 2001 ACM SIGCOMM Conference, pages 149--160, 2001. Google ScholarDigital Library
- E. Ukkonnen. Finding approximate pattern in strings. Journal of Algorithms, 6:132--137, 1985.Google ScholarCross Ref
Index Terms
A dictionary for approximate string search and longest prefix search
Recommendations
String inference from longest-common-prefix array
AbstractThe suffix array, perhaps the most important data structure in modern string processing, is often augmented with the longest common prefix (LCP) array which stores the lengths of the longest common prefixes for lexicographically ...
Inferring an indeterminate string from a prefix graph
An indeterminate string (or, more simply, just a string) x = x 1 . . n ] on an alphabet Σ is a sequence of nonempty subsets of Σ. We say that x i 1 ] and x i 2 ] match (written x i 1 ] x i 2 ] ) if and only if x i 1 ] x i 2 ] . A feasible array is an ...
Finding a longest common subsequence between a run-length-encoded string and an uncompressed string
In this paper, we propose an O(min{mN,Mn}) time algorithm for finding a longest common subsequence of strings X and Y with lengths M and N, respectively, and run-length-encoded lengths m and n, respectively. We propose a new recursive formula for ...
Comments