skip to main content
article
Free Access

A fast bit-vector algorithm for approximate string matching based on dynamic programming

Published:01 May 1999Publication History
Skip Abstract Section

Abstract

The approximate string matching problem is to find all locations at which a query of lengthm matches a substring of a text of length n with k-or-fewer differences. Simple and practical bit-vector algorithms have been designed for this problem, most notably the one used in agrep. These algorithms compute a bit representation of the current state-set of the k-difference automaton for the query, and asymptotically run in either O(nm/w) or O(nm log σ/w) time where w is the word size of the machine (e.g., 32 or 64 in practice), and σ is the size of the pattern alphabet. Here we present an algorithm of comparable simplicity that requires only O(nm/w) time by virtue of computing a bit representation of the relocatable dynamic programming matrix for the problem. Thus, the algorithm's performance is independent of k, and it is found to be more efficient than the previous results for many choices of k and smallm.

Moreover, because the algorithm is not dependent on k, it can be used to rapidly compute blocks of the dynamic programming matrix as in the 4-Russians algorithm of Wu et al.(1996). This gives rise to an O(kn/w) expected-time algorithm for the case where m may be arbitrarily large. In practice this new algorithm, that computes a region of the dynamic progr amming (d.p.) matrx w entries at a time using the basic algorithm as a subroutine is significantly faster than our previous 4-Russians algorithm, that computes the same region 4 or 5 entries at a time using table lookup. This performance improvement yields a code that is either superior or competitive with all existing algorithms except for some filtration algorithms that are superior when k/m is sufficiently small.

References

  1. BAEZA-YATES, R. A., AND GONNET, G.H. 1992. A new approach to text searching. Commun. A CM 35, 74-82. Google ScholarGoogle Scholar
  2. BAEZA-YATES, R. A., AND NAVARRO, G. 1996. A faster algorithm for approximate string matching. In Proceedings of the 7th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, Vol. 1075. Springer-Verlag, New York, pp. 1-23. Google ScholarGoogle Scholar
  3. BAEZA-YATES, R.A. AND NAVARRO, G. 1999. Analysis for algorithm engineering: Improving an algorithm for approximate pattern matching. Unpublished manuscript.Google ScholarGoogle Scholar
  4. CHAO, K.M., HARDISON, R.C., AND MILLER, W. 1992. Recent developments in linear-space alignment methods: A survey. J. Comput. Biol. 1, 271-291.Google ScholarGoogle Scholar
  5. CHANG, W. I., AND LAMPE, J. 1992. Theoretical and empirical comparisons of approximate string matching algorithms. In Proceedings of the 3rd Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 644. Springer-Verlag, New York, pp. 172-181. Google ScholarGoogle Scholar
  6. CHANG, W.I. AND LAWLER, E.L. 1994. Sublinear expected time approximate matching and biological applications. Algorithmica 12, 327-344.Google ScholarGoogle Scholar
  7. CHAO, K.M., HARDISON, R.C., AND MILLER, W. 1992. Recent developments in linear-space alignment methods: A survey. J. Comput. Biol. 1, 271-291.Google ScholarGoogle Scholar
  8. CHAO, K. M., PEARSON, W. R., AND MILLER, W. 1992. Aligning two sequences within a specified diagonal band. Comput. Appl. BioSciences 8, 481-487.Google ScholarGoogle Scholar
  9. COBBS, A. 1995. Fast approximate matching using suffix trees. In Proceedings of the 6th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 937. Springer-Verlag, New York, pp. 41-54.Google ScholarGoogle Scholar
  10. GALm, Z., AND PARK, K. 1990. An improved algorithm for approximate string matching. SIAM J. Comput. 19, 989-999. Google ScholarGoogle Scholar
  11. LANDAU, G. M., AND VISHKIN, U. 1988. Fast string matching with k differences. J. Comput. Syst. Sci. 37, 63-78. Google ScholarGoogle Scholar
  12. MASEK, W. J., AND PATERSON, M.S. 1980. A faster algorithm for computing string edit distances. J. Comput. Syst. Sci. 20, 18-31.Google ScholarGoogle Scholar
  13. MYERS, E.W. 1994. A sublinear algorithm for approximate keywords searching. Algorithmica 12, 345-374.Google ScholarGoogle Scholar
  14. PEVZNER, P., AND WATERMAN, M.S. 1995. Multiple filtration and approximate pattern matching. Algorithmica 13, 135-154.Google ScholarGoogle Scholar
  15. SELLERS, P.H. 1980. The theory and computations of evolutionary distances: Pattern recognition. J. Algorithms 1,359-373.Google ScholarGoogle Scholar
  16. SUTINEN, E., AND TARHIO, J. 1996. Filtration with q-samples in approximate string matching. In Proceedings of the 7th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 1075. Springer-Verlag, New York, pp. 50-63. Google ScholarGoogle Scholar
  17. UKKONEN, E. 1985. Finding approximate patterns in strings. J. Algorithms 6, 132-137.Google ScholarGoogle Scholar
  18. UKKONEN, E. 1992. Approximate string-matching with q-grams and maximal matches. Theoret. Comput. Sci. 92, 191-211. Google ScholarGoogle Scholar
  19. UKKONEN, E. 1993. Approximate string matching over suffix trees. In Proceedings of the 4th Symposium on Combinatorial Pattern Matching. Lecture Notes in Computer Science, vol. 684. Springer-Verlag, New York, pp. 228-242. Google ScholarGoogle Scholar
  20. WAGNER, R.A., AND FISCHER, M.J. 1974. The string to string correction problem. J. ACM 21, 168-173. Google ScholarGoogle Scholar
  21. Wu, S., AND MANBER, U. 1992. Fast text searching allowing errors. Commun. ACM 35, 10, 83-91. Google ScholarGoogle Scholar
  22. Wu, S., MANBER, U., AND MYERS, G. 1996. A subquadratic algorithm for approximate limited expression matching. Algorithmica 15, 50-67.Google ScholarGoogle Scholar
  23. WRIGHT, A.H. 1994. Approximate string matching using within-word parallelism. Soft. Pract. Exper. 24, 337-362. Google ScholarGoogle Scholar

Index Terms

  1. A fast bit-vector algorithm for approximate string matching based on dynamic programming

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader