skip to main content
10.1145/2987550.2987573acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

ReStream: Accelerating Backtesting and Stream Replay with Serial-Equivalent Parallel Processing

Published: 05 October 2016 Publication History

Abstract

Real-time predictive applications can demand continuous and agile development, with new models constantly being trained, tested, and then deployed. Training and testing are done by replaying stored event logs, running new models in the context of historical data in a form of backtesting or "what if?" analysis. To replay weeks or months of logs while developers wait, we need systems that can stream event logs through prediction logic many times faster than the real-time rate. A challenge with high-speed replay is preserving sequential semantics while harnessing parallel processing power. The crux of the problem lies with causal dependencies inherent in the sequential semantics of log replay.
We introduce an execution engine that produces serial-equivalent output while accelerating throughput with pipelining and distributed parallelism. This is made possible by optimizing for high throughput rather than the traditional stream processing goal of low latency, and by aggressive sharing of versioned state, a technique we term Multi-Versioned Parallel Streaming (MVPS). In experiments we see that this engine, which we call ReStream, performs as well as batch processing and more than an order of magnitude better than a single-threaded implementation.

References

[1]
D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: A new model and architecture for data stream management. The VLDB Journal, 12(2):120--139, Aug. 2003.
[2]
T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: Fault-tolerant stream processing at internet scale. Proceedings of the VLDB Endowment, 6(11):1033--1044, 2013.
[3]
T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, et al. The Dataflow Model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proceedings of the VLDB Endowment, 8(12):1792--1803, 2015.
[4]
Apache Flink - Scalable Batch and Stream Data Processing. http://flink.apache.org/. Accessed: 2016-08-22.
[5]
Apache Samza. https://samza.apache.org/. Accessed: 2016-08-22.
[6]
A. Arasu, B. Babcock, S. Babu, J. Cieslewicz, M. Datar, K. Ito, R. Motwani, U. Srivastava, and J. Widom. Stream: The Stanford data stream management system. Book chapter, 2004.
[7]
A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. S. Maskey, E. Ryvkina, M. Stonebraker, and R. Tibbetts. Linear Road: A stream data management benchmark. In Proceedings of the Thirtieth International Conference on Very Large Data Bases - Volume 30, pages 480--491. VLDB Endowment, 2004.
[8]
AWS --- Amazon EC2 --- Instance Types. http://aws.amazon.com/ec2/instance-types/. Accessed: 2016-08-22.
[9]
B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of the Twenty-first ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS '02, pages 1--16. ACM, 2002.
[10]
B. Baesens, V. Van Vlasselaer, and W. Verbeke. Fraud Analytics Using Descriptive, Predictive, and Social Network Techniques: A Guide to Data Science for Fraud Detection. John Wiley & Sons, 2015.
[11]
R. S. Barga, J. Goldstein, M. Ali, and M. Hong. Consistent streaming through time: A vision for event stream processing. arXiv preprint cs/0612115, 2006.
[12]
P. A. Bernstein and N. Goodman. Concurrency control in distributed database systems. ACM Computing Surveys (CSUR), 13(2):185--221, 1981.
[13]
B. Berstel. Extending the Rete algorithm for event management. In Temporal Representation and Reasoning, 2002. TIME 2002. Proceedings. Ninth International Symposium on, pages 49--51. IEEE, 2002.
[14]
R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 725--736. ACM, 2013.
[15]
B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J. C. Platt, J. F. Terwilliger, and J. Wernsing. Trill: A high-performance incremental query processor for diverse analytics. Proceedings of the VLDB Endowment, 8(4):401--412, 2014.
[16]
S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah. TelegraphCQ: Continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD '03, pages 668--668. ACM, 2003.
[17]
L. Chen, A. Mislove, and C. Wilson. An empirical analysis of algorithmic pricing on Amazon marketplace. In Proceedings of the 25th International Conference on World Wide Web, WWW '16, pages 1339--1349. International World Wide Web Conferences Steering Committee, 2016.
[18]
G. Cugola and A. Margara. Processing flows of information: From data stream to complex event processing. ACM Comput. Surv., 44(3):15:1--15:62, June 2012.
[19]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. OSDI, 2004.
[20]
EsperTech - Esper - Event Series Analysis and Complex Event Processing for Java. http://www.espertech.com/products/esper.php. Accessed: 2016-08-22.
[21]
J. M. Faleiro and D. J. Abadi. Rethinking serializable multiversion concurrency control. Proceedings of the VLDB Endowment, 8(11):1190--1201, July 2015.
[22]
C. L. Forgy. Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence, 19(1):17--37, 1982.
[23]
D. Gelernter. Generative communication in Linda. ACM Transactions on Programming Languages and Systems (TOPLAS), 7(1):80--112, 1985.
[24]
G. Graefe. Encapsulation of parallelism in the Volcano query processing system. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, SIGMOD '90, pages 102--111. ACM, 1990.
[25]
J. Hall, C. Kendrick, and C. Nosko. The effects of Uber's surge pricing: A case study. The University of Chicago Booth School of Business, 2015.
[26]
F. Hayes-Roth. Rule-based systems. Commun. ACM, 28(9): 921--932, Sept. 1985.
[27]
IBM Operational Decision Manager. https://www.ibm.com/software/products/en/odm. Accessed: 2016-08-22.
[28]
IBM Streams. https://www.ibm.com/software/products/en/ibm-streams. Accessed: 2016-08-22.
[29]
JBoss Drools - Business Rules Management System. http://www.drools.org/. Accessed: 2016-08-22.
[30]
A. Lakshman and P. Malik. Cassandra: A decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44(2):35--40, Apr. 2010.
[31]
W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan. Muppet: MapReduce-style processing of fast data. Proceedings of the VLDB Endowment, 5(12):1814--1825, 2012.
[32]
T. J. LeBlanc and J. M. Mellor-Crummey. Debugging parallel programs with instant replay. IEEE Transactions on Computers, 100(4):471--482, 1987.
[33]
M. Maas, T. Harris, K. Asanović, and J. Kubiatowicz. Trash day: Coordinating garbage collection in distributed systems. In Proceedings of the 15th USENIX Conference on Hot Topics in Operating Systems, HOTOS'15. USENIX Association, 2015.
[34]
H. B. McMahan, G. Holt, D. Sculley, M. Young, D. Ebner, J. Grady, L. Nie, T. Phillips, E. Davydov, D. Golovin, S. Chikkerur, D. Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, and J. Kubica. Ad click prediction: A view from the trenches. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '13, pages 1222--1230. ACM, 2013.
[35]
F. McSherry, M. Isard, and D. G. Murray. Scalability! but at what COST? In 15th Workshop on Hot Topics in Operating Systems (HotOS XV), 2015.
[36]
J. Meehan, N. Tatbul, S. Zdonik, C. Aslantas, U. Çetintemel, J. Du, T. Kraska, S. Madden, D. Maier, A. Pavlo, et al. S-Store: Streaming meets transaction processing. Proceedings of the VLDB Endowment, 8(13):2134--2145, 2015.
[37]
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3 (1-2):330--339, Sept. 2010.
[38]
A. Milenkoski, M. Vieira, S. Kounev, A. Avritzer, and B. D. Payne. Evaluating computer intrusion detection systems: A survey of common practices. ACM Comput. Surv., 48(1):12:1--12:41, Sept. 2015.
[39]
A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. Measurement and analysis of online social networks. In Proceedings of the 7th ACM SIGCOMM Conference on Internet Measurement, IMC '07, pages 29--42. ACM, 2007.
[40]
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: A timely dataflow system. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 439--455. ACM, 2013.
[41]
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, pages 170--177. IEEE, 2010.
[42]
Oracle Complex Event Processing. http://www.oracle.com/technetwork/middleware/complex-event-processing/overview/index.html. Accessed: 2016-08-22.
[43]
J. Schleier-Smith. An architecture for Agile machine learning in real-time applications. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '15, pages 2059--2068. ACM, 2015.
[44]
M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, and M. J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In Data Engineering, 2003. Proceedings. 19th International Conference on, pages 25--36. IEEE, 2003.
[45]
M. Stonebraker, U. Çetintemel, and S. Zdonik. The 8 requirements of real-time stream processing. SIGMOD Rec., 34(4):42--47, Dec. 2005.
[46]
StreamBase Complex Event Processing - TIBCO. http://www.tibco.com/products/event-processing/complex-event-processing/streambase-complex-event-processing. Accessed: 2016-08-22.
[47]
N. Tatbul, U. Çetintemel, S. Zdonik, M. Cherniack, and M. Stonebraker. Load shedding in a data stream manager. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29, VLDB '03, pages 309--320. VLDB Endowment, 2003.
[48]
A. Thomson, T. Diamond, S.-C. Weng, K. Ren, P. Shao, and D. J. Abadi. Calvin: Fast distributed transactions for partitioned database systems. In Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, SIGMOD '12, pages 1--12. ACM, 2012.
[49]
A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm@Twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 147--156. ACM, 2014.
[50]
P. Treleaven, M. Galas, and V. Lalchand. Algorithmic trading review. Commun. ACM, 56(11):76--85, Nov. 2013.
[51]
E. Wu, Y. Diao, and S. Rizvi. High-performance complex event processing over streams. In Proceedings of the 2006 ACM SIGMOD International Conference on Management of Data, pages 407--418. ACM, 2006.
[52]
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, volume 10, 2010.
[53]
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 423--438. ACM, 2013.

Cited By

View all
  • (2019)An Exploratory Study of How Specialists Deal with Testing in Data Stream Processing Applications2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1109/ESEM.2019.8870186(1-6)Online publication date: Sep-2019
  • (2018)Penguin: Efficient Query-Based Framework for Replaying Large Scale Historical DataIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.282975929:10(2333-2345)Online publication date: 1-Oct-2018

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SoCC '16: Proceedings of the Seventh ACM Symposium on Cloud Computing
October 2016
534 pages
ISBN:9781450345255
DOI:10.1145/2987550
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Stream replay
  2. backtesting
  3. distributed stream processing

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

SoCC '16
Sponsor:
SoCC '16: ACM Symposium on Cloud Computing
October 5 - 7, 2016
CA, Santa Clara, USA

Acceptance Rates

SoCC '16 Paper Acceptance Rate 38 of 151 submissions, 25%;
Overall Acceptance Rate 169 of 722 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2019)An Exploratory Study of How Specialists Deal with Testing in Data Stream Processing Applications2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM)10.1109/ESEM.2019.8870186(1-6)Online publication date: Sep-2019
  • (2018)Penguin: Efficient Query-Based Framework for Replaying Large Scale Historical DataIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2018.282975929:10(2333-2345)Online publication date: 1-Oct-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media