ABSTRACT
Regular expressions (regexes) are a powerful string manipulation tool. Unfortunately, in programming languages like Python, Java, and JavaScript, they are unnecessarily dangerous, implemented with worst-case exponential matching behavior. This high time complexity exposes software services to regular expression denial of service (ReDoS) attacks.
We propose a data-driven redesign of regex engines, to reflect how regexes are used and what they typically look like. We report that about 95% of regexes in popular programming languages can be evaluated in linear time. The regex engine is a fundamental component of a programming language, and any changes risk introducing compatibility problems. We believe a full redesign is therefore impractical, and so we describe how the vast majority of regex matches can be made linear-time with minor, not major, changes to existing algorithms. Our prototype shows that on a kernel of the regex language, we can trade space for time to make regex matches safe
- {n. d.}. re1: A simple regular expression engine, easy to read and study. https: //code.google.com/archive/p/re1.Google Scholar
- Alfred V Aho. 1990. Algorithms for finding patterns in strings. Elsevier, Chapter 5, 255–300.Google Scholar
- Carl Chapman and Kathryn T Stolee. 2016. Exploring regular expression usage and context in Python. International Symposium on Software Testing and Analysis (ISSTA) (2016). Google ScholarDigital Library
- Russ Cox. 2007. Regular Expression Matching Can Be Simple And Fast (but is slow in Java, Perl, PHP, Python, Ruby, ...). https://swtch.com/~rsc/regexp/regexp1.htmlGoogle Scholar
- Scott Crosby. 2003. Denial of service through regular expressions. USENIX Security work in progress report (2003).Google Scholar
- Scott A Crosby and Dan S Wallach. 2003. Denial of Service via Algorithmic Complexity Attacks. In USENIX Security. Google ScholarDigital Library
- James C Davis, Christy A Coghlan, Francisco Servant, and Dongyoon Lee. 2018. The Impact of Regular Expression Denial of Service (ReDoS) in Practice: an Empirical Study at the Ecosystem Scale. In The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). Google ScholarDigital Library
- James C Davis, Michael Michael, Christy A Coghlan, Francisco Servant, and Dongyoon Lee. 2019. Are Regular Expressions a Lingua Franca? An Empirical Study on the Re-use and Portability of Regular Expressions. In The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). Google ScholarDigital Library
- James C Davis, Eric R Williamson, and Dongyoon Lee. 2018. A Sense of Time for JavaScript and Node.js: First-Class Timeouts as a Cure for Event Handler Poisoning. In USENIX Security Symposium (USENIX Security). Google ScholarDigital Library
- A Ojamaa and K Duuna. 2012. Assessing the security of Node.js platform. In 7th International Conference for Internet Technology and Secured Transactions (ICITST).Google Scholar
- Asiri Rathnayake and Hayo Thielecke. 2014. Static Analysis for Regular Expression Exponential Runtime via Substructural Logics. Technical Report.Google Scholar
- Alex Roichman and Adar Weidman. 2009. VAC - ReDoS: Regular Expression Denial Of Service. Open Web Application Security Project (OWASP) (2009).Google Scholar
- Michael Sipser. 2006. Introduction to the Theory of Computation. Vol. 2. Thomson Course Technology Boston. Google ScholarDigital Library
- Henry Spencer. 1994. A regular-expression matcher. In Software solutions in C. 35–71. Google ScholarDigital Library
- Cristian-Alexandru Staicu and Michael Pradel. 2018. Freezing the Web: A Study of ReDoS Vulnerabilities in JavaScript-based Web Servers. In USENIX Security Symposium (USENIX Security). https://www.npmjs.com/package/safe-regexhttp: //mp.binaervarianz.de/ReDoS_TR_Dec2017.pdf Google ScholarDigital Library
- Bryan Sullivan. 2010. New Tool: SDL Regex Fuzzer. https://blogs.microsoft.com/ microsoftsecure/2010/10/12/new-tool-sdl-regex-fuzzer/Google Scholar
- Ken Thompson. 1968. Regular Expression Search Algorithm. Communications of the ACM (CACM) (1968). Google ScholarDigital Library
- Peipei Wang and Kathryn T Stolee. 2018. How well are regular expressions tested in the wild?. In Foundations of Software Engineering (FSE). Google ScholarDigital Library
- Nicolaas Hendrik Weideman. 2017. Static Analysis of Regular Expressions. Ph.D. Dissertation. Stellenbosch University.Google Scholar
- Wikipedia contributors. 2018. Regular expression — Wikipedia, The Free Encyclopedia. https://web.archive.org/web/20180920152821/https://en.wikipedia.org/ w/index.php?title=Regular_expression.Google Scholar
- Valentin Wustholz, Oswaldo Olivo, Marijn J H Heule, and Isil Dillig. 2017. Static Detection of DoS Vulnerabilities in Programs that use Regular Expressions. In International Conference on Tools and Algorithms for the Construction and Analysis of Systems (TACAS). Abstract 1 Introduction 2 Background and Related Work 3 Empirical Motivation 4 Regex Engine Design 5 Results 6 Conclusions References Google ScholarDigital Library
Index Terms
- Rethinking Regex engines to address ReDoS
Recommendations
Regex matching with counting-set automata
We propose a solution to the problem of efficient matching regular expressions (regexes) with bounded repetition, such as (ab){1,100}, using deterministic automata. For this, we introduce novel counting-set automata (CsAs), automata with registers that ...
Why aren’t regular expressions a lingua franca? an empirical study on the re-use and portability of regular expressions
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringThis paper explores the extent to which regular expressions (regexes) are portable across programming languages. Many languages offer similar regex syntaxes, and it would be natural to assume that regexes can be ported across language boundaries. But ...
The impact of regular expression denial of service (ReDoS) in practice: an empirical study at the ecosystem scale
ESEC/FSE 2018: Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringRegular expressions (regexes) are a popular and powerful means of automatically manipulating text. Regexes are also an understudied denial of service vector (ReDoS). If a regex has super-linear worst-case complexity, an attacker may be able to trigger ...
Comments