skip to main content
10.1145/2463676.2465289acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Improving regular-expression matching on strings using negative factors

Published:22 June 2013Publication History

ABSTRACT

The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.

References

  1. R. A. Baeza-Yates and G. H. Gonnet. Fast text searching for regular expressions or automaton searching on tries. J. ACM, 43(6):915 -- 936, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Simanek. The factor automaton. Kybernetika, 38(1):105 -- 111, 2002.Google ScholarGoogle Scholar
  3. M. Crochemore, A. Czumaj, L. Gasieniec, S. Jarominek, T. Lecroq, W. Plandowski, and W. Rytter. Speeding up two strings matching algorithms. Algorithmica, 12(4/5):247 -- 267, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  4. GNUgrep. ftp://reality.sgiweb.org/freeware/relnotes/fw-5.3/fw gnugrep/gnugrep.html.Google ScholarGoogle Scholar
  5. J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, Reading, Massachusetts, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. F. Kolakowski, J. Leunissen, and J. E. Smith. Prosearch: Fast searching of protein sequences with regular expression patterns related to protein structure and function. Biotechniques, 13:919 -- 921, 1992.Google ScholarGoogle Scholar
  7. T. W. Lam, W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu. Compressed indexing and local alignment of DNA. Bioinformatics, 24(6):791--797, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Mohri. String matching with automata. Nordic Journal of Computing, 4(2):217 -- 231, 1997.Google ScholarGoogle Scholar
  9. C. Navarro. NR-grep: a fast and flexible pattern matching tool. Software Practice and Experience (SPE), 31:1265 -- 1312, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Navarro and M. Raffinot. Fast and flexible string matching by combining bit-parallelism and suffix automata. ACM Journal of Experimental Algorithmics (JEA), 5:4, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Navarro and M. Raffinot. Compact DFA representation for fast regular expression search. In Proceedings of WAE'01, Lecture Notes in Computer Science 2141, pages 1 -- 12, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Navarro and M. Raffinot. New techniques for regular expression searching. Algorithmica, 41(2):89 -- 116, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  13. R. Staden. Screening protein and nucleic acid sequences against libraries of patterns. J. DNA Sequencing Mapping, 1:369 -- 374, 1991.Google ScholarGoogle Scholar
  14. B. W. Watson. A new regula grammar pattern matching algorithm. In Proceedings of the 4th Annual European Sysmposium, Lecture Notes in Computer Science 1136, pages 364 -- 377. Springer-Verlag, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Wu and U. Manber. Fast text searching allowing errors. Comm. of the ACM, 35(10):83 -- 91, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Improving regular-expression matching on strings using negative factors

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
          June 2013
          1322 pages
          ISBN:9781450320375
          DOI:10.1145/2463676

          Copyright © 2013 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 June 2013

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader