ABSTRACT
The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.
- R. A. Baeza-Yates and G. H. Gonnet. Fast text searching for regular expressions or automaton searching on tries. J. ACM, 43(6):915 -- 936, 1996. Google ScholarDigital Library
- M. Simanek. The factor automaton. Kybernetika, 38(1):105 -- 111, 2002.Google Scholar
- M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two strings matching algorithms. Algorithmica, 12(4/5):247 -- 267, 1994.Google ScholarCross Ref
- GNUgrep. ftp://reality.sgiweb.org/freeware/relnotes/fw-5.3/fw gnugrep/gnugrep.html.Google Scholar
- J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, Reading, Massachusetts, 1979. Google ScholarDigital Library
- L. F. Kolakowski, J. Leunissen, and J. E. Smith. Prosearch: Fast searching of protein sequences with regular expression patterns related to protein structure and function. Biotechniques, 13:919 -- 921, 1992.Google Scholar
- T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu. Compressed indexing and local alignment of DNA. Bioinformatics, 24(6):791--797, 2008. Google ScholarDigital Library
- M. Mohri. String matching with automata. Nordic Journal of Computing, 4(2):217 -- 231, 1997.Google Scholar
- C. Navarro. NR-grep: a fast and flexible pattern matching tool. Software Practice and Experience (SPE), 31:1265 -- 1312, 2001. Google ScholarDigital Library
- C. Navarro and M. Raffinot. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics (JEA), 5:4, 2000. Google ScholarDigital Library
- C. Navarro and M. Raffinot. Compact DFA representation for fast regular expression search. In Proceedings of WAE'01, Lecture Notes in Computer Science 2141, pages 1 -- 12, 2001. Google ScholarDigital Library
- C. Navarro and M. Raffinot. New techniques for regular expression searching. Algorithmica, 41(2):89 -- 116, 2004.Google ScholarCross Ref
- R. Staden. Screening protein and nucleic acid sequences against libraries of patterns. J. DNA Sequencing Mapping, 1:369 -- 374, 1991.Google Scholar
- B. W. Watson. A new regula grammar pattern matching algorithm. In Proceedings of the 4th Annual European Sysmposium, Lecture Notes in Computer Science 1136, pages 364 -- 377. Springer-Verlag, 1996. Google ScholarDigital Library
- S. Wu and U. Manber. Fast text searching allowing errors. Comm. of the ACM, 35(10):83 -- 91, 1992. Google ScholarDigital Library
Index Terms
Improving regular-expression matching on strings using negative factors
Recommendations
Negative Factor: Improving Regular-Expression Matching in Strings
Special Issue: Invited 2014 PODS and EDBT Revised ArticlesThe problem of finding matches of a regular expression (RE) on a string exists in many applications, such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each ...
A regular expression matching circuit: Decomposed non-deterministic realization with prefix sharing and multi-character transition
This paper shows a compact realization of regular expression matching circuits on FPGAs. First, the given regular expression is converted into a non-deterministic finite automaton (NFA) by the modified McNaughton-Yamada method. Second, to reduce the ...
Fast Pattern Matching in Strings
An algorithm is presented which finds all occurrences of one given string within another, in running time proportional to the sum of the lengths of the strings. The constant of proportionality is low enough to make this algorithm of practical use, and the ...
Comments