skip to main content
article

Increased bit-parallelism for approximate and multiple string matching

Published: 31 December 2005 Publication History

Abstract

Bit-parallelism permits executing several operations simultaneously over a set of bits or numbers stored in a single computer word. This technique permits searching for the approximate occurrences of a pattern of length m in a text of length n in time O(⌈m/wn), where w is the number of bits in the computer word. Although this is asymptotically the optimal bit-parallel speedup over the basic O(mn) time algorithm, it wastes bit-parallelism's power in the common case where m is much smaller than w, since wm bits in the computer words are unused. In this paper, we explore different ways to increase the bit-parallelism when the search pattern is short. First, we show how multiple patterns can be packed into a single computer word so as to search for all them simultaneously. Instead of spending O(rn) time to search for r patterns of length mw/2, we need O(⌈rm/wn) time. Second, we show how the mechanism permits boosting the search for a single pattern of length mw/2, which can be searched for in O(⌈n/⌊w/m⌋⌉) bit-parallel steps instead of O(n). Third, we show how to extend these algorithms so that the time bounds essentially depend on k instead of m, where k is the maximum number of differences permitted. Finally, we show how the ideas can be applied to other problems such as multiple exact string matching and one-against-all computation of edit distance and longest common subsequences. Our experimental results show that the new algorithms work well in practice, obtaining significant speedups over the best existing alternatives, especially on short patterns and moderate number of differences allowed. This work fills an important gap in the field, where little work has focused on very short patterns.

References

[1]
Aho, A. and Corasick, M. 1975. Efficient string matching: an aid to bibliographic search. Communications of the ACM 18, 6, 333--340.
[2]
Allauzen, C. and Raffinot, M. 1999. Factor oracle of a set of words. Technical Report 99--11, Institut Gaspard-Monge, Université de Marne-la-Vallée.
[3]
Allison, L. and Dix, T. L. 1986. A bit-string longest common subsequence algorithm. Information Processing Letters 23, 305--310.
[4]
Baeza-Yates, R. and Gonnet, G. 1992. A new approach to text searching. Communications of the ACM 35, 10, 74--82.
[5]
Baeza-Yates, R. and Navarro, G. 1999. Faster approximate string matching. Algorithmica 23, 2, 127--158.
[6]
Crochemore, M., Iliopoulos, C. S., Pinzon, Y. J., and Reid, J. F. 2001. A fast and practical bit-vector algorithm for the longest common subsequence problem. Information Processing Letters 80, 279--285.
[7]
Fredriksson, K. 2003. Row-wise tiling for the Myers' bit-parallel dynamic programming algorithm. In Proc. 10th International Symposium on String Processing and Information Retrieval (SPIRE '03). LNCS 2857. Berlin, Germany, Springer, New York. 66--79.
[8]
Fredriksson, K. and Navarro, G. 2004. Average-optimal single and multiple approximate string matching. ACM Journal of Experimental Algorithmics (JEA). 9, 1.4.
[9]
Horspool, R. N. 1980. Practical fast searching in strings. Software Practice and Experience 10, 6, 501--506.
[10]
Hyyrö, H. 2001. Explaining and extending the bit-parallel approximate string matching algorithm of Myers. Tech. Rep. A-2001-10, Department of Computer and Information Sciences, University of Tampere, Tampere, Finland.
[11]
Hyyrö, H. 2003. A bit-vector algorithm for computing Levenshtein and Damerau edit distances. Nordic Journal of Computing 10, 1--11.
[12]
Hyyrö, H. 2004. Bit-parallel LCS-length computation revisited. In Proc. 15th Australasian Workshop on Combinatorial Algorithms (AWOCA '04). 16--27.
[13]
Hyyrö, H. and Navarro, G. 2002. Faster bit-parallel approximate string matching. In Proc. 13th Combinatorial Pattern Matching (CPM '02). LNCS 2373. Berlin, Germany, Springer, New York. 203--224.
[14]
Hyyrö, H. and Navarro, G. 2005. Bit-parallel witnesses and their applications to approximate string matching. Algorithmica 41, 3, 203--231.
[15]
Levenshtein, V. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady 10, 8, 707--710. {Original in Russian in Doklady Akademii Nauk SSSR, 163(4):845--848, 1965}.
[16]
Muth, R. and Manber, U. 1996. Approximate multiple string search. In Proc. 7th Annual Symposium on Combinatorial Pattern Matching (CPM '96). LNCS 1075. Springer-Verlag, New York. 75--86.
[17]
Myers, G. 1999. A fast bit-vector algorithm for approximate string matching based on dynamic progamming. Journal of the ACM 46, 3, 395--415.
[18]
Navarro, G. 2001. A guided tour to approximate string matching. ACM Computing Surveys 33, 1, 31--88.
[19]
Navarro, G. and Baeza-Yates, R. 1999. Very fast and simple approximate string matching. Information Processing Letters 72, 65--70.
[20]
Navarro, G. and Baeza-Yates, R. 2001. Improving an algorithm for approximate string matching. Algorithmica 30, 4, 473--502.
[21]
Navarro, G. and Raffinot, M. 2000. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics (JEA). 5, 4.
[22]
Navarro, G. and Raffinot, M. 2002. Flexible Pattern Matching in Strings---Practical on-line search algorithms for texts and biological sequences. Cambridge University Press, Cambridge, UK. ISBN 0-521-81307-7.
[23]
Sellers, P. 1980. The theory and computation of evolutionary distances: pattern recognition. Journal of Algorithms 1, 359--373.
[24]
Ukkonen, E. 1985a. Algorithms for approximate string matching. Information and Control 64, 1--3, 100--118.
[25]
Ukkonen, E. 1985b. Finding approximate patterns in strings. Journal of Algorithms 6, 132--137.
[26]
Warren, H. S. 2003. Hacker's Delight. Addison-Wesley, Boston, MA. ISBN 0-201-91465-4.
[27]
Wu, S. and Manber, U. 1992. Fast text searching allowing errors. Communications of the ACM 35, 10, 83--91.

Cited By

View all
  • (2023)Non-Local Parallel Processing and Database Settlement Using Multiple Teleportation Followed by Grover Post-SelectionEntropy10.3390/e2502037625:2(376)Online publication date: 18-Feb-2023
  • (2019)Fuzzysplit: demultiplexing and trimming sequenced DNA with a declarative languagePeerJ10.7717/peerj.71707(e7170)Online publication date: 19-Jun-2019
  • (2017)The implementation of bit-parallelism for DNA sequence alignmentJournal of Physics: Conference Series10.1088/1742-6596/835/1/012004835(012004)Online publication date: 17-May-2017
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Journal of Experimental Algorithmics
ACM Journal of Experimental Algorithmics  Volume 10, Issue
2005
291 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/1064546
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 31 December 2005
Published in JEA Volume 10

Author Tags

  1. Approximate string matching
  2. bit-parallelism
  3. multiple string matching

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Non-Local Parallel Processing and Database Settlement Using Multiple Teleportation Followed by Grover Post-SelectionEntropy10.3390/e2502037625:2(376)Online publication date: 18-Feb-2023
  • (2019)Fuzzysplit: demultiplexing and trimming sequenced DNA with a declarative languagePeerJ10.7717/peerj.71707(e7170)Online publication date: 19-Jun-2019
  • (2017)The implementation of bit-parallelism for DNA sequence alignmentJournal of Physics: Conference Series10.1088/1742-6596/835/1/012004835(012004)Online publication date: 17-May-2017
  • (2016)An efficient pruning strategy for approximate string matching over suffix treeKnowledge and Information Systems10.1007/s10115-015-0896-649:1(121-141)Online publication date: 1-Oct-2016
  • (2014)State-of-the-art in string similarity search and joinACM SIGMOD Record10.1145/2627692.262770643:1(64-76)Online publication date: 13-May-2014
  • (2014)BitPAl: a bit-parallel, general integer-scoring sequence alignment algorithmBioinformatics10.1093/bioinformatics/btu50730:22(3166-3173)Online publication date: 29-Jul-2014
  • (2013)A Two-Hashing Table Multiple String Pattern Matching AlgorithmProceedings of the 2013 10th International Conference on Information Technology: New Generations10.1109/ITNG.2013.108(696-701)Online publication date: 15-Apr-2013
  • (2013)Approximate pattern matching with k-mismatches in packed textInformation Processing Letters10.1016/j.ipl.2013.07.002113:19-21(693-697)Online publication date: 1-Sep-2013
  • (2012)New Hashing-Based Multiple String Pattern Matching AlgorithmsProceedings of the 2012 Ninth International Conference on Information Technology - New Generations10.1109/ITNG.2012.34(195-200)Online publication date: 16-Apr-2012
  • (2010)From nondeterministic suffix automaton to lazy suffix treeAlgorithms and Applications10.5555/2167962.2167970(114-129)Online publication date: 1-Jan-2010
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media