research-article

ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Authors:

Johann Schleier-Smith,

Erik T. Krogen,

Joseph M. HellersteinAuthors Info & Claims

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

Pages 334 - 347

https://doi.org/10.1145/2987550.2987573

Published: 05 October 2016 Publication History

Abstract

Real-time predictive applications can demand continuous and agile development, with new models constantly being trained, tested, and then deployed. Training and testing are done by replaying stored event logs, running new models in the context of historical data in a form of backtesting or "what if?" analysis. To replay weeks or months of logs while developers wait, we need systems that can stream event logs through prediction logic many times faster than the real-time rate. A challenge with high-speed replay is preserving sequential semantics while harnessing parallel processing power. The crux of the problem lies with causal dependencies inherent in the sequential semantics of log replay.

We introduce an execution engine that produces serial-equivalent output while accelerating throughput with pipelining and distributed parallelism. This is made possible by optimizing for high throughput rather than the traditional stream processing goal of low latency, and by aggressive sharing of versioned state, a technique we term Multi-Versioned Parallel Streaming (MVPS). In experiments we see that this engine, which we call ReStream, performs as well as batch processing and more than an order of magnitude better than a single-threaded implementation.

References

[1]

D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2):120--139, Aug. 2003.

Digital Library

[2]

T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 6(11):1033--1044, 2013.

Digital Library

[3]

T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, et al. The Dataflow Model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12):1792--1803, 2015.

Digital Library

[4]

Apache Flink - Scalable Batch and Stream Data Processing. http://flink.apache.org/. Accessed: 2016-08-22.

[5]

Apache Samza. https://samza.apache.org/. Accessed: 2016-08-22.

[6]

A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. Stream: The Stanford data stream management system. Book chapter, 2004.

[7]

A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. S. Maskey, E. Ryvkina, M. Stonebraker, and R. Tibbetts. Linear Road: A stream data management benchmark. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, pages 480--491. VLDB Endowment, 2004.

Digital Library

[8]

AWS --- Amazon EC2 --- Instance Types. http://aws.amazon.com/ec2/instance-types/. Accessed: 2016-08-22.

[9]

B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '02, pages 1--16. ACM, 2002.

Digital Library

[10]

B. Baesens, V. Van Vlasselaer, and W. Verbeke. Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection. John Wiley & Sons, 2015.

Digital Library

[11]

R. S. Barga, J. Goldstein, M. Ali, and M. Hong. Consistent streaming through time: A vision for event stream processing. arXiv preprint cs/0612115, 2006.

[12]

P. A. Bernstein and N. Goodman. Concurrency control in distributed database systems. ACM Computing Surveys (CSUR), 13(2):185--221, 1981.

Digital Library

[13]

B. Berstel. Extending the Rete algorithm for event management. In Temporal Representation and Reasoning, 2002. TIME 2002. Proceedings. Ninth International Symposium on, pages 49--51. IEEE, 2002.

Digital Library

[14]

R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 725--736. ACM, 2013.

Digital Library

[15]

B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J. C. Platt, J. F. Terwilliger, and J. Wernsing. Trill: A high-performance incremental query processor for diverse analytics. Proceedings of the VLDB Endowment, 8(4):401--412, 2014.

Digital Library

[16]

S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD '03, pages 668--668. ACM, 2003.

Digital Library

[17]

L. Chen, A. Mislove, and C. Wilson. An empirical analysis of algorithmic pricing on Amazon marketplace. In Proceedings of the 25th International Conference on World Wide Web, WWW '16, pages 1339--1349. International World Wide Web Conferences Steering Committee, 2016.

Digital Library

[18]

G. Cugola and A. Margara. Processing flows of information: From data stream to complex event processing. ACM Comput. Surv., 44(3):15:1--15:62, June 2012.

Digital Library

[19]

J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. OSDI, 2004.

Digital Library

[20]

EsperTech - Esper - Event Series Analysis and Complex Event Processing for Java. http://www.espertech.com/products/esper.php. Accessed: 2016-08-22.

[21]

J. M. Faleiro and D. J. Abadi. Rethinking serializable multiversion concurrency control. Proceedings of the VLDB Endowment, 8(11):1190--1201, July 2015.

Digital Library

[22]

C. L. Forgy. Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence, 19(1):17--37, 1982.

Digital Library

[23]

D. Gelernter. Generative communication in Linda. ACM Transactions on Programming Languages and Systems (TOPLAS), 7(1):80--112, 1985.

Digital Library

[24]

G. Graefe. Encapsulation of parallelism in the Volcano query processing system. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, SIGMOD '90, pages 102--111. ACM, 1990.

Digital Library

[25]

J. Hall, C. Kendrick, and C. Nosko. The effects of Uber's surge pricing: A case study. The University of Chicago Booth School of Business, 2015.

[26]

F. Hayes-Roth. Rule-based systems. Commun. ACM, 28(9): 921--932, Sept. 1985.

Digital Library

[27]

IBM Operational Decision Manager. https://www.ibm.com/software/products/en/odm. Accessed: 2016-08-22.

[28]

IBM Streams. https://www.ibm.com/software/products/en/ibm-streams. Accessed: 2016-08-22.

[29]

JBoss Drools - Business Rules Management System. http://www.drools.org/. Accessed: 2016-08-22.

[30]

A. Lakshman and P. Malik. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35--40, Apr. 2010.

Digital Library

[31]

W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan. Muppet: MapReduce-style processing of fast data. Proceedings of the VLDB Endowment, 5(12):1814--1825, 2012.

Digital Library

[32]

T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant replay. IEEE Transactions on Computers, 100(4):471--482, 1987.

Digital Library

[33]

M. Maas, T. Harris, K. Asanović, and J. Kubiatowicz. Trash day: Coordinating garbage collection in distributed systems. In Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems, HOTOS'15. USENIX Association, 2015.

Digital Library

[34]

H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, and J. Kubica. Ad click prediction: A view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '13, pages 1222--1230. ACM, 2013.

Digital Library

[35]

F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what COST? In 15th Workshop on Hot Topics in Operating Systems (HotOS XV), 2015.

Digital Library

[36]

J. Meehan, N. Tatbul, S. Zdonik, C. Aslantas, U. Çetintemel, J. Du, T. Kraska, S. Madden, D. Maier, A. Pavlo, et al. S-Store: Streaming meets transaction processing. Proceedings of the VLDB Endowment, 8(13):2134--2145, 2015.

Digital Library

[37]

S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3 (1-2):330--339, Sept. 2010.

Digital Library

[38]

A. Milenkoski, M. Vieira, S. Kounev, A. Avritzer, and B. D. Payne. Evaluating computer intrusion detection systems: A survey of common practices. ACM Comput. Surv., 48(1):12:1--12:41, Sept. 2015.

Digital Library

[39]

A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, IMC '07, pages 29--42. ACM, 2007.

Digital Library

[40]

D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439--455. ACM, 2013.

Digital Library

[41]

L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, pages 170--177. IEEE, 2010.

Digital Library

[42]

Oracle Complex Event Processing. http://www.oracle.com/technetwork/middleware/complex-event-processing/overview/index.html. Accessed: 2016-08-22.

[43]

J. Schleier-Smith. An architecture for Agile machine learning in real-time applications. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '15, pages 2059--2068. ACM, 2015.

Digital Library

[44]

M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, and M. J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In Data Engineering, 2003. Proceedings. 19th International Conference on, pages 25--36. IEEE, 2003.

[45]

M. Stonebraker, U. Çetintemel, and S. Zdonik. The 8 requirements of real-time stream processing. SIGMOD Rec., 34(4):42--47, Dec. 2005.

Digital Library

[46]

StreamBase Complex Event Processing - TIBCO. http://www.tibco.com/products/event-processing/complex-event-processing/streambase-complex-event-processing. Accessed: 2016-08-22.

[47]

N. Tatbul, U. Çetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB '03, pages 309--320. VLDB Endowment, 2003.

Digital Library

[48]

A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, pages 1--12. ACM, 2012.

Digital Library

[49]

A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm@Twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 147--156. ACM, 2014.

Digital Library

[50]

P. Treleaven, M. Galas, and V. Lalchand. Algorithmic trading review. Commun. ACM, 56(11):76--85, Nov. 2013.

Digital Library

[51]

E. Wu, Y. Diao, and S. Rizvi. High-performance complex event processing over streams. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pages 407--418. ACM, 2006.

Digital Library

[52]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, volume 10, 2010.

Digital Library

[53]

M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 423--438. ACM, 2013.

Digital Library

Cited By

Vianna AFerreira WGama K(2019)An Exploratory Study of How Specialists Deal with Testing in Data Stream Processing Applications2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1109/ESEM.2019.8870186(1-6)Online publication date: Sep-2019
https://doi.org/10.1109/ESEM.2019.8870186
Gu RZhou YWang ZYuan CHuang Y(2018)Penguin: Efficient Query-Based Framework for Replaying Large Scale Historical DataIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.282975929:10(2333-2345)Online publication date: 1-Oct-2018
https://doi.org/10.1109/TPDS.2018.2829759

Index Terms

ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing
1. Information systems

Recommendations

Auto-parallelizing stateful distributed streaming applications
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Streaming applications transform possibly infinite streams of data and often have both high throughput and low latency requirements. They are comprised of operator graphs that produce and consume data tuples. The streaming programming model naturally ...
Retrospective on Aurora

This experience paper summarizes the key lessons we learned throughout the design and implementation of the Aurora stream-processing engine. For the past 2 years, we have built five stream-based applications using Aurora. We first describe in detail ...
Priority-Based Resource Scheduling in Distributed Stream Processing Systems for Big Data Applications
UCC '14: Proceedings of the 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing

Distributed Stream Processing Systems (DSPSs) are attracting increasing industrial and academic interest as flexible tools to implement scalable and cost-effective on-line analytics applications over Big Data streams. Often hosted in private/public ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing

October 2016

534 pages

ISBN:9781450345255

DOI:10.1145/2987550

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SoCC '16

Sponsor:

SoCC '16: ACM Symposium on Cloud Computing

October 5 - 7, 2016

CA, Santa Clara, USA

Acceptance Rates

SoCC '16 Paper Acceptance Rate 38 of 151 submissions, 25%;

Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
382
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)1

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Vianna AFerreira WGama K(2019)An Exploratory Study of How Specialists Deal with Testing in Data Stream Processing Applications2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1109/ESEM.2019.8870186(1-6)Online publication date: Sep-2019
https://doi.org/10.1109/ESEM.2019.8870186
Gu RZhou YWang ZYuan CHuang Y(2018)Penguin: Efficient Query-Based Framework for Replaying Large Scale Historical DataIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.282975929:10(2333-2345)Online publication date: 1-Oct-2018
https://doi.org/10.1109/TPDS.2018.2829759

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten