ABSTRACT
Many online data sets evolve over time as new entries are slowly added and existing entries are deleted or modified. Taking advantage of this, systems for incremental bulk data processing, such as Google's Percolator, can achieve efficient updates. To achieve this efficiency, however, these systems lose compatibility with the simple programming models offered by non-incremental systems, e.g., MapReduce, and more importantly, requires the programmer to implement application-specific dynamic algorithms, ultimately increasing algorithm and code complexity.
In this paper, we describe the architecture, implementation, and evaluation of Incoop, a generic MapReduce framework for incremental computations. Incoop detects changes to the input and automatically updates the output by employing an efficient, fine-grained result reuse mechanism. To achieve efficiency without sacrificing transparency, we adopt recent advances in the area of programming languages to identify the shortcomings of task-level memoization approaches, and to address these shortcomings by using several novel techniques: a storage system, a contraction phase for Reduce tasks, and an affinity-based scheduling algorithm. We have implemented Incoop by extending the Hadoop framework, and evaluated it by considering several applications and case studies. Our results show significant performance improvements without changing a single line of application code.
- U. A. Acar, G. E. Blelloch, M. Blume, R. Harper, and K. Tangwongsan. An experimental analysis of self-adjusting computation. ACM Trans. Programming Languages and Systems, 32(1):1--53, 2009. Google ScholarDigital Library
- U. A. Acar, G. E. Blelloch, and R. Harper. Adaptive functional programming. ACM Trans. Programming Languages and Systems, 28(6):990--1034, 2006. Google ScholarDigital Library
- P. Bhatotia, A. Wieder, I. E. Akkus, R. Rodrigues, and U. A. Acar. Large-scale incremental data processing with change propagation. In USENIX Workshop on Hot Topics in Cloud Computing (HotCloud'11). Google ScholarDigital Library
- Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. In 36th International Conference on Very Large Data Bases, Singapore, September 14--16, 2010. Google ScholarDigital Library
- S. Ceri and J. Widom. Deriving production rules for incremental view maintenance. In Proceedings of the 17th International Conference on Very Large Data Bases, pages 577--589, 1991. Google ScholarDigital Library
- Y.-J. Chiang and R. Tamassia. Dynamic algorithms in computational geometry. Proceedings of the IEEE, 80(9):1412--1434, 1992.Google ScholarCross Ref
- T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. In Proc. 7th Symposium on Networked systems design and implementation (NSDI'10). Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In Proc. 6th Symposium on Operating Systems Design and Implementation (OSDI'04). Google ScholarDigital Library
- C. Demetrescu, I. Finocchi, and G. Italiano. Handbook on Data Structures and Applications. CRC, 2005.Google Scholar
- P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic management of data and computation in data centers. In Proc. 9th Symp. Operating Systems Design and Implementation (OSDI'10). Google ScholarDigital Library
- M. Hammer, U. A. Acar, M. Rajagopalan, and A. Ghuloum. A proposal for parallel self-adjusting computation. In DAMP '07: Proceedings of the first workshop on Declarative Aspects of Multicore Programming, 2007. Google ScholarDigital Library
- M. A. Hammer, U. A. Acar, and Y. Chen. CEAL: A C-based language for self-adjusting computation. In Proceedings of the 2009 ACM SIGPLAN Conference on Programming Language Design and Implimentation, June 2009. Google ScholarDigital Library
- B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou. Cornet: batched stream processing for data intensive distributed computing. In Proc. 1st Symposium on Cloud computing (SoCC'10). Google ScholarDigital Library
- D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In 1st Symp. on Cloud computing(SoCC'10). Google ScholarDigital Library
- D. Logothetis, C. Trezzo, K. C. Webb, and K. Yocum. In-situ mapreduce for log processing. In Proceedings of the 2011 USENIX conference on USENIX annual technical conference, USENIXATC'11, 2011. Google ScholarDigital Library
- A. Muthitacharoen, B. Chen, and D. Mazières. A low-bandwidth network file system. In Proc. 18th Symp. on Operating systems principles (SOSP'01). Google ScholarDigital Library
- C. Olston and et.al. Nova: continuous pig/hadoop workflows. In Proceedings of the 2011 international conference on Management of data, SIGMOD, 2011. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110, 2008. Google ScholarDigital Library
- D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In Proc. 9th Symposium on Operating Systems Design and Implementation (OSDI'10). Google ScholarDigital Library
- L. Popa, M. Budiu, Y. Yu, and M. Isard. DryadInc: Reusing work in large-scale computations. In Worksh. on Hot Topics in Cloud Computing (HotCloud'09). Google ScholarDigital Library
- G. Ramalingam and T. Reps. A Categorized Bibliography on Incremental Computation. In Proc. 20th Symp. Princ. of Progr. Languages (POPL'93). Google ScholarDigital Library
- M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, OSDI'08, 2008. Google ScholarDigital Library
Index Terms
- Incoop: MapReduce for incremental computations
Recommendations
IncApprox: A Data Analytics System for Incremental Approximate Computing
WWW '16: Proceedings of the 25th International Conference on World Wide WebIncremental and approximate computations are increasingly being adopted for data analytics to achieve low-latency execution and efficient utilization of computing resources. Incremental computation updates the output incrementally instead of re-...
Incremental computation with names
OOPSLA '15Over the past thirty years, there has been significant progress in developing general-purpose, language-based approaches to incremental computation, which aims to efficiently update the result of a computation when an input is changed. A key design ...
Incremental computation with names
OOPSLA 2015: Proceedings of the 2015 ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and ApplicationsOver the past thirty years, there has been significant progress in developing general-purpose, language-based approaches to incremental computation, which aims to efficiently update the result of a computation when an input is changed. A key design ...
Comments