skip to main content
10.1145/2676726.2677014acmconferencesArticle/Chapter ViewAbstractPublication PagespoplConference Proceedingsconference-collections
research-article

Data-Parallel String-Manipulating Programs

Published: 14 January 2015 Publication History

Abstract

String-manipulating programs are an important class of programs with applications in malware detection, graphics, input sanitization for Web security, and large-scale HTML processing. This paper extends prior work on BEK, an expressive domain-specific language for writing string-manipulating programs, with algorithmic insights that make BEK both analyzable and data-parallel. By analyzable we mean that unlike most general purpose programming languages, many algebraic properties of a BEK program are decidable (i.e., one can check whether two programs commute or compute the inverse of a program). By data-parallel we mean that a BEK program can compute on arbitrary subsections of its input in parallel, thus exploiting parallel hardware. This latter requirement is particularly important for programs which operate on large data: without data parallelism, a programmer cannot hide the latency of reading data from various storage media (i.e., reading a terabyte of data from a modern hard drive takes about 3 hours). With a data-parallel approach, the system can split data across multiple disks and thus hide the latency of reading the data.
A BEK program is expressive: a programmer can use conditionals, switch statements, and registers--or local variables--in order to implement common string-manipulating programs. Unfortunately, this expressivity induces data dependencies, which are an obstacle to parallelism. The key contribution of this paper is an algorithm which automatically removes these data dependencies by mapping a B EK program into a intermediate format consisting of symbolic transducers, which extend classical transducers with symbolic predicates and symbolic assignments. We present a novel algorithm that we call exploration which performs symbolic loop unrolling of these transducers to obtain simplified versions of the original program. We show how these simplified versions can then be lifted to a stateless form, and from there compiled to data-parallel hardware.
To evaluate the efficacy of our approach, we demonstrate up to 8x speedups for a number of real-world, BEK programs, (e.g., HTML encoder and decoder) on data-parallel hardware. To the best of our knowledge, these are the first data parallel implementation of these programs. To validate that our approach is correct, we use an automatic testing technique to compare our generated code to the original implementations and find no semantic deviations.

Supplementary Material

MPG File (p139-sidebyside.mpg)

References

[1]
R. Alur and P. Cerny. Streaming transducers for algorithmic verification of single-pass list-processing programs. In Proceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '11. ACM, 2011.
[2]
D. Balzarotti, M. Cova, V. Felmetsger, N. Jovanovic, E. Kirda, C. Kruegel, and G. Vigna. Saner: Composing static and dynamic analysis to validate sanitization in web applications. In IEEE Oakland Security and Privacy, 2008.
[3]
C. Barrett and C. Tinelli. Satisfiability modulo theories. In E. Clarke, T. Henzinger, and H. Veith, editors, Handbook of Model Checking. Springer, 2014. (to appear).
[4]
D. Bates, A. Barth, and C. Jackson. Regular expressions considered harmful in client-side xss filters. In Proceedings of the 19th international conference on World wide web, WWW '10. ACM, 2010.
[5]
M. Benedikt, C. Ley, and G. Puppis. Automata vs. logics on data words. In Proceedings of the 24th international conference/19th annual conference on Computer Science Logic (CSL'10/EACSL'10), volume 6247 of LNCS, pages 110--124. Springer, 2010.
[6]
M. Bojańczyk, A. Muscholl, T. Schwentick, L. Segoufin, and C. David. Two-variable logic on words with data. In Proceedings of the 21st Annual IEEE Symposium on Logic in Computer Science (LICS'06), pages 7--16. IEEE, 2006.
[7]
M. Botinńan and D. Babirć. Sigma*: symbolic learning of input- output specifications. In Proceedings of the 40th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL'13, pages 443--456. ACM, 2013.
[8]
L. D'Antoni and M. Veanes. Static analysis of string encoders and decoders. In 14th International Conference on Verification, Model Checking, and Abstract Interpretation, VMCAI 2013, volume 7737 of LNCS, pages 209--228. Springer, 2013.
[9]
L. D'Antoni and M. Veanes. Minimization of symbolic automata. In Proceedings of the 41st ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL'14, pages 541--553. ACM, 2014.
[10]
L. de Moura and N. Bjørner. Z3: An efficient SMT solver. In C. Ramakrishnan and J. Rehof, editors, Tools and Algorithms for the Construction and Analysis of Systems, volume 4963 of LNCS. Springer, 2008.
[11]
L. De Moura and N. Bjørner. Satisfiability modulo theories: Introduction and applications. Commun. ACM, 54(9):69--77, 2011.
[12]
T. L. Gall and B. Jeannet. Lattice automata: A representation for languages on infinite alphabets, and some applications to verification. In SAS 2007, volume 4634 of LNCS, pages 52--68, 2007.
[13]
P. Godefroid. Compositional dynamic test generation. In Proceedings of the 34th annual ACM SIGPLAN-SIGACT symposium on Principles of programming languages, POPL'13, pages 47--54, 2007.
[14]
W. D. Hillis and G. L. Steele. Data parallel algorithms. In Commun. ACM, volume 29, pages 1170--1183, Dec 1986.
[15]
P. Hooimeijer, B. Livshits, D. Molnar, P. Saxena, and M. Veanes. Fast and precise sanitizer analysis with Bek. In USENIX Security, August 2011.
[16]
J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Addison Wesley, 1979.
[17]
M. Kaminski and N. Francez. Finite-memory automata. In 31st Annual IEEE Symposium on Foundations of Computer Science, volume 2 of FOCS'90, pages 683--688. IEEE, 1990.
[18]
A. Kiezun, V. Ganesh, P. J. Guo, P. Hooimeijer, and M. D. Ernst. HAMPI: a solver for string constraints. In International Symposium on Software Testing and Analysis, 2009.
[19]
R. E. Ladner and M. J. Fischer. Parallel prefix computation. Journal of the ACM, 27(4):831--838, 1980.
[20]
L. Libkin. Variable independence for first-order definable constraints. ACM Transactions on Computational Logic, 4(4):431--451, 2003.
[21]
D. Lindsay and E. V. Nava. Universal XSS via IE8's XSS filters. In Black Hat Europe, 2010.
[22]
Anti-Cross Site Scripting Library. Microsoft Corporation, http://msdn.microsoft.com/en-us/security/aa973814.aspx.
[23]
Bek guide. Microsoft Research,. http://rise4fun.com/bek/tutorial.
[24]
Y. Minamide. Static approximation of dynamically generated web pages. In WWW '05, pages 432--441, 2005.
[25]
T. Mytkowicz, M. Musuvathi, and W. Schulte. Data-parallel finite- state machines. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '14, pages 529--542, New York, 2014. ACM.
[26]
F. Neven, T. Schwentick, and V. Vianu. Finite state machines for strings over infinite alphabets. ACM Trans. CL, 5:403--435, 2004.
[27]
XSS Filter Evasion Cheat Sheet. OWASP. http://ha.ckers.org/xss.html.
[28]
P. Saxena, D. Akhawe, S. Hanna, S. McCamant, F. Mao, and D. Song. A symbolic execution framework for JavaScript. In IEEE Security and Privacy, 2010.
[29]
P. Saxena, D. Molnar, and B. Livshits. Scriptgard: Preventing script injection attacks in legacy web applications with automatic sanitization. Technical Report MSR-TR-2010--128, Microsoft Research, August 2010.
[30]
L. Segoufin. Automata and logics for words and trees over an infinite alphabet. In Z. Ésik, editor, EACSL Annual Conference on Logic in Computer Science, CSL, volume 4207 of LNCS, pages 41--57, 2006.
[31]
W. Thies, M. Karczmarek, and S. P. Amarasinghe. Streamit: A language for streaming applications. In Proceedings of the 11th International Conference on Compiler Construction, volume 2304 of LNCS, pages 179--196. Springer, 2002.
[32]
M. Veanes, P. Hooimeijer, B. Livshits, D. Molnar, and N. Bjorner. Symbolic finite state transducers: Algorithms and applications. In Proceedings of the 39th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL '12. ACM, 2012.
[33]
M. Veanes, N. Bjørner, L. Nachmanson, and S. Bereg. Monadic decomposition. In International Conference on Computer Aided Verification, volume 8559 of LNCS, pages 628--645. Springer, 2014.
[34]
P. Wadler. Deforestation: transforming programs to eliminate trees. In Proceedings of the Second European Symposium on Programming, pages 231--248, 1988.
[35]
G. Wassermann, D. Yu, A. Chander, D. Dhurjati, H. Inamura, and Z. Su. Dynamic test input generation for web applications. In International Symposium on Software Testing and Analysis, 2008.
[36]
F. Yu, T. Bultan, and O. H. Ibarra. Relational string verification using multi-track automata. In Conference on Implementation and Application of Automata, CIAA'10, pages 290--299, 2011.
[37]
D. Zhang, Q. J. Li, R. Rabbah, and S. Amarasinghe. A lightweight streaming layer for multicore execution. SIGARCH Comput. Archit. News, 36(2):18--27, May 20.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
POPL '15: Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages
January 2015
716 pages
ISBN:9781450333009
DOI:10.1145/2676726
  • cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 50, Issue 1
    POPL '15
    January 2015
    682 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2775051
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

In-Cooperation

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 January 2015

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data-parallel
  2. satisfiability modulo theories
  3. state space exploration
  4. symbolic automaton
  5. symbolic trandsducer

Qualifiers

  • Research-article

Conference

POPL '15
Sponsor:

Acceptance Rates

POPL '15 Paper Acceptance Rate 52 of 227 submissions, 23%;
Overall Acceptance Rate 860 of 4,328 submissions, 20%

Upcoming Conference

POPL '26

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)2
Reflects downloads up to 19 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2018)CSEProceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture10.1109/MICRO.2018.00012(29-41)Online publication date: 20-Oct-2018
  • (2017)Parallel Automata ProcessorACM SIGARCH Computer Architecture News10.1145/3140659.308020745:2(600-612)Online publication date: 24-Jun-2017
  • (2017)Fusing effectful comprehensionsACM SIGPLAN Notices10.1145/3140587.306236252:6(17-32)Online publication date: 14-Jun-2017
  • (2017)Parallel Automata ProcessorProceedings of the 44th Annual International Symposium on Computer Architecture10.1145/3079856.3080207(600-612)Online publication date: 24-Jun-2017
  • (2017)Fusing effectful comprehensionsProceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation10.1145/3062341.3062362(17-32)Online publication date: 14-Jun-2017
  • (2017)IntroductionString Analysis for Software Verification and Security10.1007/978-3-319-68670-7_1(1-13)Online publication date: 13-Dec-2017
  • (2017)The Power of Symbolic Automata and TransducersComputer Aided Verification10.1007/978-3-319-63387-9_3(47-67)Online publication date: 13-Jul-2017
  • (2016)Optimal sanitization synthesis for web application vulnerability repairProceedings of the 25th International Symposium on Software Testing and Analysis10.1145/2931037.2931050(189-200)Online publication date: 18-Jul-2016
  • (2016)Kleenex: compiling nondeterministic transducers to deterministic streaming transducersACM SIGPLAN Notices10.1145/2914770.283764751:1(284-297)Online publication date: 11-Jan-2016
  • (2016)Kleenex: compiling nondeterministic transducers to deterministic streaming transducersProceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages10.1145/2837614.2837647(284-297)Online publication date: 11-Jan-2016
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media