Abstract
Resource Managers like YARN and Mesos have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low level. This flexibility comes at a high cost in terms of developer effort, as each application must repeatedly tackle the same challenges (e.g., fault tolerance, task scheduling and coordination) and reimplement common mechanisms (e.g., caching, bulk-data transfers). This article presents REEF, a development framework that provides a control plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager. REEF provides mechanisms that facilitate resource reuse for data caching and state management abstractions that greatly ease the development of elastic data processing pipelines on cloud platforms that support a Resource Manager service. We illustrate the power of REEF by showing applications built atop: a distributed shell application, a machine-learning framework, a distributed in-memory caching system, and a port of the CORFU system. REEF is currently an Apache top-level project that has attracted contributors from several institutions and it is being used to develop several commercial offerings such as the Azure Stream Analytics service.
- Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. 2011. A reliable effective terascale linear learning system. CoRR abs/1110.4198 (2011).Google Scholar
- A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. 2012. Scalable inference in latent variable models. In ACM International Conference on Web Search and Data Mining (WSDM’12). Google ScholarDigital Library
- Peter Alvaro, Neil Conway, Joe Hellerstein, and William R. Marczak. 2011. Consistency analysis in bloom: A CALM and collected approach. In Conference on Innovative Data Systems Research (CIDR’11).Google Scholar
- Colin McCabe and Andrew Wang. 2013. Centralized cache management in HDFS. https://issues.apache.org/jira/browse/HDFS-4949.Google Scholar
- The Kubernetes Authors. 2015. Kubernetes. Retrieved from https://kubernetes.io/.Google Scholar
- Mahesh Balakrishnan, Dahlia Malkhi, John D. Davis, Vijayan Prabhakaran, Michael Wei, and Ted Wobber. 2013. CORFU: A distributed shared log. ACM Transactions on Computer Systems (TOCS) 31, 4 (2013), 10.Google ScholarDigital Library
- Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and Daniel Warneke. 2010. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In ACM Symposium on Cloud Computing (SoCC’10).Google ScholarDigital Library
- Alex Beutel, Markus Weimer, Vijay Narayanan, Yordan Zaykov, and Tom Minka. 2014. Elastic distributed bayesian collaborative filtering. In NIPS Workshop on Distributed Machine Learning and Matrix Computations.Google Scholar
- Vinayak Borkar, Yingyi Bu, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer, and Raghu Ramakrishnan. 2012. Declarative systems for large-scale machine learning. IEEE Technical Committee on Data Engineering (TCDE) 35, 2 (2012), 24--32.Google Scholar
- Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In International Conference on Data Engineering (ICDE’11). Google ScholarDigital Library
- Olivier Bousquet and Léon Bottou. 2007. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS’07).Google Scholar
- Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 571--582.Google Scholar
- Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, and Kunle Olukotun. 2006. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems (NIPS’06).Google Scholar
- Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
- Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: A runtime for iterative mapreduce. In ACM International Symposium on High Performance Distributed Computing (HPDC’10). Google ScholarDigital Library
- Google. 2015. Guice. Retrieved from https://github.com/google/guice.Google Scholar
- William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. 1998. MPI - The Complete Reference: Volume 2, The MPI-2 Extensions. MIT Press.Google Scholar
- Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’11).Google Scholar
- Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In ACM European Conference on Computer Systems (EuroSys’07).Google ScholarDigital Library
- Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. An analysis of traces from a production mapreduce cluster. In IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid’10). Google ScholarDigital Library
- Michael Kearns. 1998. Efficient noise-tolerant learning from statistical queries. Journal of the ACM 45, 6 (1998), 983--1006. Google ScholarDigital Library
- Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. 2000. The click modular router. ACM Transactions on Computer Systems (TOCS) 18, 3 (2000), 263--297. Google ScholarDigital Library
- J. Kreps, N. Narkhede, and J. Rao. 2011. Kafka: A distributed messaging system for log processing. In International Workshop on Networking Meets Databases (NetDB’11).Google Scholar
- Arun Kumar, Nikos Karampatziakis, Paul Mineiro, Markus Weimer, and Vijay Narayanan. 2013. Distributed and scalable PCA in the cloud. In BigLearn NIPS Workshop.Google Scholar
- Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In ACM Symposium on Cloud Computing (SoCC’14).Google ScholarDigital Library
- Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). Google ScholarDigital Library
- Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2010. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI’10).Google Scholar
- Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD’10). Google ScholarDigital Library
- Nathan Marz. 2015. Storm: Distributed and Fault-Tolerant Realtime Computation. http://storm.apache.org.Google Scholar
- Erik Meijer. 2012. Your mouse is a database. Communications of the ACM 55, 5 (2012), 66--73. Google ScholarDigital Library
- Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A timely dataflow system. In ACM Symposium on Operating Systems Principles (SOSP’13).Google ScholarDigital Library
- Shravan Narayanamurthy, Markus Weimer, Dhruv Mahajan, Tyson Condie, Sundararajan Sellamanickam, and S. Sathiya Keerthi. 2013. Towards resource-elastic machine learning. In BigLearn NIPS Workshop.Google Scholar
- Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed stream computing platform. In IEEE International Conference on Data Mining Workshops (ICDMW’10).Google ScholarDigital Library
- Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: A not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD’08). Google ScholarDigital Library
- Ariel Rabkin. 2012. Using Program Analysis to Reduce Misconfiguration in Open Source Systems Software. Ph.D. Dissertation, UC Berkeley (2012).Google Scholar
- David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1985. Learning Internal Representations by Error Propagation. Technical Report. DTIC Document. Google ScholarCross Ref
- Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. 2015. Apache tez: A unifying framework for modeling and building data processing applications. In ACM SIGMOD International Conference on Management of Data (SIGMOD’15). Google ScholarDigital Library
- Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: Flexible, scalable schedulers for large compute clusters. In ACM European Conference on Computer Systems (EuroSys’13).Google ScholarDigital Library
- Marc Shapiro and Nuno M. Preguiça. 2007. Designing a commutative replicated data type. CoRR abs/0710.1784.Google Scholar
- Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big data analytics with datalog queries on spark. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York,1135--1149. DOI:http://dx.doi.org/10.1145/2882903.2915229 Google ScholarDigital Library
- Alexander Smola and Shravan Narayanamurthy. 2010. An architecture for parallel topic models. Proceedings of the VLDB Endowment 3, 1--2 (2010), 703--710.Google ScholarDigital Library
- Michael Stonebraker and Ugur Cetintemel. 2005. One size fits all: An idea whose time has come and gone. In International Conference on Data Engineering (ICDE’05).Google ScholarDigital Library
- The Apache Software Foundation. 2017. Apache Accumulo. Retrieved from http://accumulo.apache.org/.Google Scholar
- The Apache Software Foundation. 2017. Apache Giraph. Retrieved from http://giraph.apache.org/.Google Scholar
- The Apache Software Foundation. 2017. Apache Hadoop. Retrieved from http://hadoop.apache.org.Google Scholar
- The Apache Software Foundation. 2017. Apache Mahout. Retrieved from http://mahout.apache.org.Google Scholar
- The Apache Software Foundation. 2017. Apache Slider. Retrieved from http://slider.incubator.apache.org/.Google Scholar
- The Apache Software Foundation. 2017. Apache Twill. Retrieved from http://twill.apache.org/.Google Scholar
- The Netty Project. 2015. Netty. Retrieved from http://netty.io.Google Scholar
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive -- A warehousing solution over a map-reduce framework. In Proceedings of the VLDB Endowment (PVLDB’09).Google Scholar
- Leslie G. Valiant. 1990. A bridging model for parallel computation. Communications of the ACM 33, 8 (1990), 103--111. Google ScholarDigital Library
- Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache hadoop YARN: Yet another resource negotiator. In ACM Symposium on Cloud Computing (SoCC’13).Google ScholarDigital Library
- Markus Weimer, Yingda Chen, Byung-Gon Chun, Tyson Condie, Carlo Curino, Chris Douglas, Yunseong Lee, Tony Majestro, Dahlia Malkhi, Sergiy Matusevych, Brandon Myers, Shravan Narayanamuthy, Raghu Ramakrishnan, Sriram Rao, Russell Sear, Beysim Sezgin, and Julia Wang. 2015. REEF: Retainable evaluator execution framework. In ACM SIGMOD International Conference on Management of Data (SIGMOD’15). Google ScholarDigital Library
- Markus Weimer, Sriram Rao, and Martin Zinkevich. 2010. A convenient framework for efficient parallel multipass algorithms. In NIPS Workshop on Learning on Cores, Clusters and Clouds.Google Scholar
- Matt Welsh. 2013. What I wish systems researchers would work on. Retrieved from http://matt-welsh.blogspot.com/2013/05/what-i-wish-systems-researchers-would.html.Google Scholar
- Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: An architecture for well-conditioned, scalable internet services. SIGOPS Operating Systems Review 35 (2001), 230--243. Google ScholarDigital Library
- Jerry Ye, Jyh-Herng Chow, Jiang Chen, and Zhaohui Zheng. 2009. Stochastic gradient boosted distributed decision trees. In ACM Conference on Information and Knowledge Management (CIKM’09). Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’12).Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’10).Google Scholar
- Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel databases meet mapreduce. VLDB Journal 21, 5 (2012), 611--636. Google ScholarDigital Library
Index Terms
- Apache REEF: Retainable Evaluator Execution Framework
Recommendations
REEF: Retainable Evaluator Execution Framework
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataResource Managers like Apache YARN have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. This flexibility comes at a ...
Comments