skip to main content
research-article
Open Access

Apache REEF: Retainable Evaluator Execution Framework

Authors Info & Claims
Published:10 October 2017Publication History
Skip Abstract Section

Abstract

Resource Managers like YARN and Mesos have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low level. This flexibility comes at a high cost in terms of developer effort, as each application must repeatedly tackle the same challenges (e.g., fault tolerance, task scheduling and coordination) and reimplement common mechanisms (e.g., caching, bulk-data transfers). This article presents REEF, a development framework that provides a control plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager. REEF provides mechanisms that facilitate resource reuse for data caching and state management abstractions that greatly ease the development of elastic data processing pipelines on cloud platforms that support a Resource Manager service. We illustrate the power of REEF by showing applications built atop: a distributed shell application, a machine-learning framework, a distributed in-memory caching system, and a port of the CORFU system. REEF is currently an Apache top-level project that has attracted contributors from several institutions and it is being used to develop several commercial offerings such as the Azure Stream Analytics service.

References

  1. Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. 2011. A reliable effective terascale linear learning system. CoRR abs/1110.4198 (2011).Google ScholarGoogle Scholar
  2. A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. 2012. Scalable inference in latent variable models. In ACM International Conference on Web Search and Data Mining (WSDM’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Peter Alvaro, Neil Conway, Joe Hellerstein, and William R. Marczak. 2011. Consistency analysis in bloom: A CALM and collected approach. In Conference on Innovative Data Systems Research (CIDR’11).Google ScholarGoogle Scholar
  4. Colin McCabe and Andrew Wang. 2013. Centralized cache management in HDFS. https://issues.apache.org/jira/browse/HDFS-4949.Google ScholarGoogle Scholar
  5. The Kubernetes Authors. 2015. Kubernetes. Retrieved from https://kubernetes.io/.Google ScholarGoogle Scholar
  6. Mahesh Balakrishnan, Dahlia Malkhi, John D. Davis, Vijayan Prabhakaran, Michael Wei, and Ted Wobber. 2013. CORFU: A distributed shared log. ACM Transactions on Computer Systems (TOCS) 31, 4 (2013), 10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and Daniel Warneke. 2010. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In ACM Symposium on Cloud Computing (SoCC’10).Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Alex Beutel, Markus Weimer, Vijay Narayanan, Yordan Zaykov, and Tom Minka. 2014. Elastic distributed bayesian collaborative filtering. In NIPS Workshop on Distributed Machine Learning and Matrix Computations.Google ScholarGoogle Scholar
  9. Vinayak Borkar, Yingyi Bu, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer, and Raghu Ramakrishnan. 2012. Declarative systems for large-scale machine learning. IEEE Technical Committee on Data Engineering (TCDE) 35, 2 (2012), 24--32.Google ScholarGoogle Scholar
  10. Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In International Conference on Data Engineering (ICDE’11). Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Olivier Bousquet and Léon Bottou. 2007. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS’07).Google ScholarGoogle Scholar
  12. Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 571--582.Google ScholarGoogle Scholar
  13. Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, and Kunle Olukotun. 2006. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems (NIPS’06).Google ScholarGoogle Scholar
  14. Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.Google ScholarGoogle Scholar
  15. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: A runtime for iterative mapreduce. In ACM International Symposium on High Performance Distributed Computing (HPDC’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Google. 2015. Guice. Retrieved from https://github.com/google/guice.Google ScholarGoogle Scholar
  18. William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. 1998. MPI - The Complete Reference: Volume 2, The MPI-2 Extensions. MIT Press.Google ScholarGoogle Scholar
  19. Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’11).Google ScholarGoogle Scholar
  20. Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In ACM European Conference on Computer Systems (EuroSys’07).Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. An analysis of traces from a production mapreduce cluster. In IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Michael Kearns. 1998. Efficient noise-tolerant learning from statistical queries. Journal of the ACM 45, 6 (1998), 983--1006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. 2000. The click modular router. ACM Transactions on Computer Systems (TOCS) 18, 3 (2000), 263--297. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Kreps, N. Narkhede, and J. Rao. 2011. Kafka: A distributed messaging system for log processing. In International Workshop on Networking Meets Databases (NetDB’11).Google ScholarGoogle Scholar
  25. Arun Kumar, Nikos Karampatziakis, Paul Mineiro, Markus Weimer, and Vijay Narayanan. 2013. Distributed and scalable PCA in the cloud. In BigLearn NIPS Workshop.Google ScholarGoogle Scholar
  26. Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In ACM Symposium on Cloud Computing (SoCC’14).Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2010. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI’10).Google ScholarGoogle Scholar
  29. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD’10). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Nathan Marz. 2015. Storm: Distributed and Fault-Tolerant Realtime Computation. http://storm.apache.org.Google ScholarGoogle Scholar
  31. Erik Meijer. 2012. Your mouse is a database. Communications of the ACM 55, 5 (2012), 66--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A timely dataflow system. In ACM Symposium on Operating Systems Principles (SOSP’13).Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Shravan Narayanamurthy, Markus Weimer, Dhruv Mahajan, Tyson Condie, Sundararajan Sellamanickam, and S. Sathiya Keerthi. 2013. Towards resource-elastic machine learning. In BigLearn NIPS Workshop.Google ScholarGoogle Scholar
  34. Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed stream computing platform. In IEEE International Conference on Data Mining Workshops (ICDMW’10).Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: A not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Ariel Rabkin. 2012. Using Program Analysis to Reduce Misconfiguration in Open Source Systems Software. Ph.D. Dissertation, UC Berkeley (2012).Google ScholarGoogle Scholar
  37. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1985. Learning Internal Representations by Error Propagation. Technical Report. DTIC Document. Google ScholarGoogle ScholarCross RefCross Ref
  38. Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. 2015. Apache tez: A unifying framework for modeling and building data processing applications. In ACM SIGMOD International Conference on Management of Data (SIGMOD’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: Flexible, scalable schedulers for large compute clusters. In ACM European Conference on Computer Systems (EuroSys’13).Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Marc Shapiro and Nuno M. Preguiça. 2007. Designing a commutative replicated data type. CoRR abs/0710.1784.Google ScholarGoogle Scholar
  41. Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big data analytics with datalog queries on spark. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York,1135--1149. DOI:http://dx.doi.org/10.1145/2882903.2915229 Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Alexander Smola and Shravan Narayanamurthy. 2010. An architecture for parallel topic models. Proceedings of the VLDB Endowment 3, 1--2 (2010), 703--710.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Michael Stonebraker and Ugur Cetintemel. 2005. One size fits all: An idea whose time has come and gone. In International Conference on Data Engineering (ICDE’05).Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. The Apache Software Foundation. 2017. Apache Accumulo. Retrieved from http://accumulo.apache.org/.Google ScholarGoogle Scholar
  45. The Apache Software Foundation. 2017. Apache Giraph. Retrieved from http://giraph.apache.org/.Google ScholarGoogle Scholar
  46. The Apache Software Foundation. 2017. Apache Hadoop. Retrieved from http://hadoop.apache.org.Google ScholarGoogle Scholar
  47. The Apache Software Foundation. 2017. Apache Mahout. Retrieved from http://mahout.apache.org.Google ScholarGoogle Scholar
  48. The Apache Software Foundation. 2017. Apache Slider. Retrieved from http://slider.incubator.apache.org/.Google ScholarGoogle Scholar
  49. The Apache Software Foundation. 2017. Apache Twill. Retrieved from http://twill.apache.org/.Google ScholarGoogle Scholar
  50. The Netty Project. 2015. Netty. Retrieved from http://netty.io.Google ScholarGoogle Scholar
  51. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive -- A warehousing solution over a map-reduce framework. In Proceedings of the VLDB Endowment (PVLDB’09).Google ScholarGoogle Scholar
  52. Leslie G. Valiant. 1990. A bridging model for parallel computation. Communications of the ACM 33, 8 (1990), 103--111. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache hadoop YARN: Yet another resource negotiator. In ACM Symposium on Cloud Computing (SoCC’13).Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Markus Weimer, Yingda Chen, Byung-Gon Chun, Tyson Condie, Carlo Curino, Chris Douglas, Yunseong Lee, Tony Majestro, Dahlia Malkhi, Sergiy Matusevych, Brandon Myers, Shravan Narayanamuthy, Raghu Ramakrishnan, Sriram Rao, Russell Sear, Beysim Sezgin, and Julia Wang. 2015. REEF: Retainable evaluator execution framework. In ACM SIGMOD International Conference on Management of Data (SIGMOD’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Markus Weimer, Sriram Rao, and Martin Zinkevich. 2010. A convenient framework for efficient parallel multipass algorithms. In NIPS Workshop on Learning on Cores, Clusters and Clouds.Google ScholarGoogle Scholar
  56. Matt Welsh. 2013. What I wish systems researchers would work on. Retrieved from http://matt-welsh.blogspot.com/2013/05/what-i-wish-systems-researchers-would.html.Google ScholarGoogle Scholar
  57. Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: An architecture for well-conditioned, scalable internet services. SIGOPS Operating Systems Review 35 (2001), 230--243. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Jerry Ye, Jyh-Herng Chow, Jiang Chen, and Zhaohui Zheng. 2009. Stochastic gradient boosted distributed decision trees. In ACM Conference on Information and Knowledge Management (CIKM’09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’12).Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’10).Google ScholarGoogle Scholar
  61. Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel databases meet mapreduce. VLDB Journal 21, 5 (2012), 611--636. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Apache REEF: Retainable Evaluator Execution Framework

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Computer Systems
          ACM Transactions on Computer Systems  Volume 35, Issue 2
          May 2017
          113 pages
          ISSN:0734-2071
          EISSN:1557-7333
          DOI:10.1145/3129286
          Issue’s Table of Contents

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 10 October 2017
          • Accepted: 1 July 2017
          • Revised: 1 February 2017
          • Received: 1 December 2015
          Published in tocs Volume 35, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader