skip to main content
10.1145/1183614.1183723acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

A dictionary for approximate string search and longest prefix search

Published:06 November 2006Publication History

ABSTRACT

In this paper we propose a dictionary data structure for string search with errors where the query string may didiffer from the expected matching string by a few edits. This data structure can also be used to find the database string with the longest common prefix with few errors. Specifically, with a database of n random strings, each of length of O(m), we show how to perform string search on a query string that differs from its closest match by k edits using a data structure of linear size and query time equal to Õ(log n 2 log n klog a 2m over 2m). This means that if k < m over log a 2m log n, then the query time is Õ(1). This is of significant in practice as there are several applications where k is small relative to m. Our approach converts strings into bit vectors so that similar strings can map to similar bit vectors with small hamming distance. A simple reduction can be used to obtain similar results for approximate longest prefix search.

References

  1. Alexandr Andoni and Piotr Indyk. Efficient algorithms for substring near neighbor problem. In Proceedings of the 17th annual ACM-SIAM symposium on Discrete algorithm, pages 1203--1212, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ziv Bar-Yossef, T. S. Jayram, Robert Krauthgamer, and Ravi Kumar. Approximating edit distance efficiently. In Proc. 45th Symposium on Foundations of Computer Science (FOCS 2004), pages 550--559, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Burton H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey, 2002.Google ScholarGoogle Scholar
  5. Andrei Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of SEQUENCES SEQS: Sequences '91, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60(3):630--659, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Richard Cole, Lee-Ad Gottlieb, and Moshe Lewenstein. Dictionary matching and indexing with errors and don't cares. In Proceedings of the 36th annual ACM symposium on Theory of computing (STOC), pages 91--100, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cristian Estan and George Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. ACM Trans. Comput. Syst., 21(3):270--313, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Piotr Indyk and Rajeev Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. of 30th ACM Symposium on Theory of Computing (STOC), pages 604--613, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Adam Kirsch and Michael Mitzenmacher. Distance-sensitive bloom filters. In Proceedings of ALENEX 06, 2006.Google ScholarGoogle Scholar
  11. Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. In Proc. of 30th ACM Symposium on Theory of Computing (STOC), pages 614--623, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gad M. Landau and Uzi Vishkin. Efficient string matching with k mismatches. Theoretical Computer Science, 43:239--249, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. D. McIllroy. Development of a spelling list. IEEE Transactions on Communications, 30(1):91--99, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  14. James K. Mullin and Daniel J. Margoliash. A tale of three spelling checkers. Software - Practice and Experience, 20(6):625--630, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gonzalo Navarro. A guided tour to approximate string matching. ACM Computing Surveys, 33(1):31--88, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Rafail Ostrovsky and Yuval Rabani. Low distortion embeddings for edit distance. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing, pages 218--224, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Shang and T. H. Merrettal. Tries for approximate string matching. IEEE Transactions on Knowledge and Data Engineering, 8(4):540--547, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ion Stoica, Robert Morris, David Karger, Frans Kaashoek, and Hari Balakrishnan. Chord: A scalable Peer-To-Peer lookup service for internet applications. In Proceedings of the 2001 ACM SIGCOMM Conference, pages 149--160, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. E. Ukkonnen. Finding approximate pattern in strings. Journal of Algorithms, 6:132--137, 1985.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A dictionary for approximate string search and longest prefix search

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      CIKM '06: Proceedings of the 15th ACM international conference on Information and knowledge management
      November 2006
      916 pages
      ISBN:1595934332
      DOI:10.1145/1183614

      Copyright © 2006 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 November 2006

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      Overall Acceptance Rate1,861of8,427submissions,22%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader