Abstract
We describe the timely dataflow model for distributed computation and its implementation in the Naiad system. The model supports stateful iterative and incremental computations. It enables both low-latency stream processing and high-throughput batch processing, using a new approach to coordination that combines asynchronous and fine-grained synchronous execution. We describe two of the programming frameworks built on Naiad: GraphLINQ for parallel graph processing, and differential dataflow for nested iterative and incremental computations. We show that a general-purpose system can achieve performance that matches, and sometimes exceeds, that of specialized systems.
- Abadi, M., Isard, M. Timely dataflow: A model. In Proc. FORTE (2015), 131--145.Google Scholar
- Abadi, M., Isard, M. Timely rollback: Specification and verification. In Proc. NASA Formal Methods (April 2015), 19--34.Google ScholarCross Ref
- Akidau, T., Balikov, A., Bekiroğlu, K., Chernyak, S., Haberman, J., Lax, R., McVeety, S., Mills, D., Nordstrom, P., Whittle, S. MillWheel: Fault-tolerant stream processing at internet scale. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1033--1044. Google ScholarDigital Library
- Chandramouli, B., Goldstein, J., Maier, D. On-the-fly progress detection in iterative stream queries. Proc. VLDB Endow. 2, 1 (Aug. 2009), 241--252. Google ScholarDigital Library
- Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E. Bigtable: A distributed storage system for structured data. In Proc. OSDI (Nov. 2006), 205--218. Google ScholarDigital Library
- Dean, J., Ghemawat, S. Mapreduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. Google ScholarDigital Library
- DeWitt, D., Gray, J. Parallel database systems: The future of high performance database systems. Commun. ACM 35, 6 (June 1992), 85--98. Google ScholarDigital Library
- Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V. Spinning fast iterative data flows. Proc. VLDB Endow. 5, 11 (July 2012), 1268--1279. Google ScholarDigital Library
- Gog, I., Giceva, J., Schwarzkopf, M., Vaswani, K., Vytiniotis, D., Ramalingam, G., Costa, M., Murray, D.G., Hand, S., Isard, M. Broom: Sweeping out garbage collection from big data systems. In Proc. HotOS (May 2015). Google ScholarDigital Library
- Gog, I., Schwarzkopf, M., Crooks, N., Grosvenor, M.P., Clement, A., Hand, S. Musketeer: All for one, one for all in data processing systems. In Proc. EuroSys (Apr. 2015). Google ScholarDigital Library
- Gonzalez, J.E., Low, Y., Gu, H., Bickson, D., Guestrin, C. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proc. OSDI (Oct. 2012), 17--30. Google ScholarDigital Library
- Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J., Stoica, I. GraphX: Graph processing in a distributed dataflow framework. In Proc. OSDI (Oct. 2014), 599--613. Google ScholarDigital Library
- Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. EuroSys (Mar. 2007), 59--72. Google ScholarDigital Library
- Lee, E., Messerschmitt, D.G. Synchronous data flow. Proc. IEEE 75, 9 (1987), 1235--1245.Google ScholarCross Ref
- Li, M., Andersen, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.-Y. Scaling distributed machine learning with the parameter server. In Proc. OSDI (Oct. 2014), 583--598. Google ScholarDigital Library
- McSherry, F., Isard, M., Murray, D.G. Scalability! But at what COST? In Proc. HotOS (May 2015). Google ScholarDigital Library
- McSherry, F., Murray, D.G., Isaacs, R., Isard, M. Differential dataflow. In Proc. CIDR (Jan. 2013).Google Scholar
- Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T. Dremel: Interactive analysis of web-scale datasets. Proc. VLDB Endow. Proc. VLDB Endow. 3, 1--2 (Sep. 2010), 330--339. Google ScholarDigital Library
- Murray, D.G., McSherry, F., Isaacs, R., Isard, M., Barham, P., Abadi, M. Naiad: A timely dataflow system. In Proc. SOSP (Nov. 2013), 439--455. Google ScholarDigital Library
- Murray, D.G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., Hand, S. CIEL: A universal execution engine for distributed data-flow computing. In Proc. NSDI (Mar. 2011), 113--126. Google ScholarDigital Library
- Peng, D., Dabek, F. Large-scale incremental processing using distributed transactions and notifications. In Proc. OSDI (Oct. 2010), 251--264. Google ScholarDigital Library
- Sousa, M., Dillig, I., Vytiniotis, D., Dillig, T., Gkantsidis, C. Consolidation of queries with user-defined functions. In Proc. PLDI (June 2014), 554--564. Google ScholarDigital Library
- Tel, G., Mattern, F. The derivation of distributed termination detection algorithms from garbage collection schemes. ACM Trans. Program. Lang. Syst. 15, 1 (Jan. 1993), 1--35. Google ScholarDigital Library
- Tucker, P.A., Maier, D., Sheard, T., Fegaras, L. Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowledge Data Eng. 15, 3 (2003), 555--568. Google ScholarDigital Library
- Yu, Y., Gunda, P.K., Isard, M. Distributed aggregation for data-parallel computing: Interfaces and implementations. In Proc. SOSP (Oct. 2009), 247--260. Google ScholarDigital Library
- Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P.K., Currey, J. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. OSDI (Dec. 2008), 1--14. Google ScholarDigital Library
- Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M., Shenker, S., Stoica, I. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proc. NSDI (Apr. 2012). Google ScholarDigital Library
Index Terms
- Incremental, iterative data processing with timely dataflow
Recommendations
An Incremental Version of Iterative Data Flow Analysis
A technique is presented for incrementally updating solutions to both union and intersection data-flow problems in response to program edits and transformations. For generality, the technique is based on the iterative approach to computing data-flow ...
Large-scale incremental data processing with change propagation
HotCloud'11: Proceedings of the 3rd USENIX conference on Hot topics in cloud computingIncremental processing of large-scale data is an increasingly important problem, given that many processing jobs run repeatedly with similar inputs, and that the de facto standard programmingmodel (MapReduce) was not designed to efficiently process ...
Incremental stream processing using computational conflict-free replicated data types
CloudDP '13: Proceedings of the 3rd International Workshop on Cloud Data and PlatformsInformation has become a key commodity for most service providers. Analyzing streams of data efficiently, in real time, has become increasingly more important for supporting new products and applications.
This paper outlines a novel abstraction for ...
Comments