research-article

Open Access

Apache REEF: Retainable Evaluator Execution Framework

Authors:
Byung-Gon Chun

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Tyson Condie

University of California at Los Angeles, Los Angeles, CA, USA

University of California at Los Angeles, Los Angeles, CA, USA
View Profile

,
Yingda Chen

Alibaba, Seattle, WA, USA

Alibaba, Seattle, WA, USA
View Profile

,
Brian Cho

Facebook, Menlo Park, CA, USA

Facebook, Menlo Park, CA, USA
View Profile

,
Andrew Chung

Carnegie Mellon University, Pittsburgh, PA, USA

Carnegie Mellon University, Pittsburgh, PA, USA
View Profile

,
Carlo Curino

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Chris Douglas

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Matteo Interlandi

University of California at Los Angeles, Los Angeles, CA, USA

University of California at Los Angeles, Los Angeles, CA, USA
View Profile

,
Beomyeol Jeon

University of Illinois at Urbana-Champaign, Urbana, IL, USA

University of Illinois at Urbana-Champaign, Urbana, IL, USA
View Profile

,
Joo Seong Jeong

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Gyewon Lee

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Yunseong Lee

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Tony Majestro

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Dahlia Malkhi

VMware, Palo Alto, CA, USA

VMware, Palo Alto, CA, USA
View Profile

,
Sergiy Matusevych

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Brandon Myers

University of Iowa, Iowa City, IA, USA

University of Iowa, Iowa City, IA, USA
View Profile

,
Mariia Mykhailova

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Shravan Narayanamurthy

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Joseph Noor

University of California at Los Angeles, Los Angeles, CA, USA

University of California at Los Angeles, Los Angeles, CA, USA
View Profile

,
Raghu Ramakrishnan

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Sriram Rao

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Russell Sears

Pure Storage, Mountain View, CA, USA

Pure Storage, Mountain View, CA, USA
View Profile

,
Beysim Sezgin

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Taegeon Um

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

,
Julia Wang

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Markus Weimer

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

,
Youngseok Yang

Seoul National University, Seoul, South Korea

Seoul National University, Seoul, South Korea
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 35 Issue 2Article No.: 5pp 1–31https://doi.org/10.1145/3132037

Published:10 October 2017Publication History

ACM Transactions on Computer Systems

Abstract

Resource Managers like YARN and Mesos have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low level. This flexibility comes at a high cost in terms of developer effort, as each application must repeatedly tackle the same challenges (e.g., fault tolerance, task scheduling and coordination) and reimplement common mechanisms (e.g., caching, bulk-data transfers). This article presents REEF, a development framework that provides a control plane for scheduling and coordinating task-level (data-plane) work on cluster resources obtained from a Resource Manager. REEF provides mechanisms that facilitate resource reuse for data caching and state management abstractions that greatly ease the development of elastic data processing pipelines on cloud platforms that support a Resource Manager service. We illustrate the power of REEF by showing applications built atop: a distributed shell application, a machine-learning framework, a distributed in-memory caching system, and a port of the CORFU system. REEF is currently an Apache top-level project that has attracted contributors from several institutions and it is being used to develop several commercial offerings such as the Azure Stream Analytics service.

References

Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. 2011. A reliable effective terascale linear learning system. CoRR abs/1110.4198 (2011).Google Scholar
A. Ahmed, M. Aly, J. Gonzalez, S. Narayanamurthy, and A. J. Smola. 2012. Scalable inference in latent variable models. In ACM International Conference on Web Search and Data Mining (WSDM’12). Google ScholarDigital Library
Peter Alvaro, Neil Conway, Joe Hellerstein, and William R. Marczak. 2011. Consistency analysis in bloom: A CALM and collected approach. In Conference on Innovative Data Systems Research (CIDR’11).Google Scholar
Colin McCabe and Andrew Wang. 2013. Centralized cache management in HDFS. https://issues.apache.org/jira/browse/HDFS-4949.Google Scholar
The Kubernetes Authors. 2015. Kubernetes. Retrieved from https://kubernetes.io/.Google Scholar
Mahesh Balakrishnan, Dahlia Malkhi, John D. Davis, Vijayan Prabhakaran, Michael Wei, and Ted Wobber. 2013. CORFU: A distributed shared log. ACM Transactions on Computer Systems (TOCS) 31, 4 (2013), 10.Google ScholarDigital Library
Dominic Battré, Stephan Ewen, Fabian Hueske, Odej Kao, Volker Markl, and Daniel Warneke. 2010. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In ACM Symposium on Cloud Computing (SoCC’10).Google ScholarDigital Library
Alex Beutel, Markus Weimer, Vijay Narayanan, Yordan Zaykov, and Tom Minka. 2014. Elastic distributed bayesian collaborative filtering. In NIPS Workshop on Distributed Machine Learning and Matrix Computations.Google Scholar
Vinayak Borkar, Yingyi Bu, Michael J. Carey, Joshua Rosen, Neoklis Polyzotis, Tyson Condie, Markus Weimer, and Raghu Ramakrishnan. 2012. Declarative systems for large-scale machine learning. IEEE Technical Committee on Data Engineering (TCDE) 35, 2 (2012), 24--32.Google Scholar
Vinayak Borkar, Michael Carey, Raman Grover, Nicola Onose, and Rares Vernica. 2011. Hyracks: A flexible and extensible foundation for data-intensive computing. In International Conference on Data Engineering (ICDE’11). Google ScholarDigital Library
Olivier Bousquet and Léon Bottou. 2007. The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (NIPS’07).Google Scholar
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible, and Karthik Kalyanaraman. 2014. Project adam: Building an efficient and scalable deep learning training system. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). 571--582.Google Scholar
Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, and Kunle Olukotun. 2006. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems (NIPS’06).Google Scholar
Jeffrey Dean, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew Senior, Paul Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Advances in Neural Information Processing Systems. 1223--1231.Google Scholar
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
Jaliya Ekanayake, Hui Li, Bingjing Zhang, Thilina Gunarathne, Seung-Hee Bae, Judy Qiu, and Geoffrey Fox. 2010. Twister: A runtime for iterative mapreduce. In ACM International Symposium on High Performance Distributed Computing (HPDC’10). Google ScholarDigital Library
Google. 2015. Guice. Retrieved from https://github.com/google/guice.Google Scholar
William Gropp, Steven Huss-Lederman, Andrew Lumsdaine, Ewing Lusk, Bill Nitzberg, William Saphir, and Marc Snir. 1998. MPI - The Complete Reference: Volume 2, The MPI-2 Extensions. MIT Press.Google Scholar
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’11).Google Scholar
Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. 2007. Dryad: Distributed data-parallel programs from sequential building blocks. In ACM European Conference on Computer Systems (EuroSys’07).Google ScholarDigital Library
Soila Kavulya, Jiaqi Tan, Rajeev Gandhi, and Priya Narasimhan. 2010. An analysis of traces from a production mapreduce cluster. In IEEE/ACM International Conference on Cluster, Cloud and Grid Computing (CCGrid’10). Google ScholarDigital Library
Michael Kearns. 1998. Efficient noise-tolerant learning from statistical queries. Journal of the ACM 45, 6 (1998), 983--1006. Google ScholarDigital Library
Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M. Frans Kaashoek. 2000. The click modular router. ACM Transactions on Computer Systems (TOCS) 18, 3 (2000), 263--297. Google ScholarDigital Library
J. Kreps, N. Narkhede, and J. Rao. 2011. Kafka: A distributed messaging system for log processing. In International Workshop on Networking Meets Databases (NetDB’11).Google Scholar
Arun Kumar, Nikos Karampatziakis, Paul Mineiro, Markus Weimer, and Vijay Narayanan. 2013. Distributed and scalable PCA in the cloud. In BigLearn NIPS Workshop.Google Scholar
Haoyuan Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, and Ion Stoica. 2014. Tachyon: Reliable, memory speed storage for cluster computing frameworks. In ACM Symposium on Cloud Computing (SoCC’14).Google ScholarDigital Library
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling distributed machine learning with the parameter server. In USENIX Symposium on Operating Systems Design and Implementation (OSDI’14). Google ScholarDigital Library
Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. 2010. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence (UAI’10).Google Scholar
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD’10). Google ScholarDigital Library
Nathan Marz. 2015. Storm: Distributed and Fault-Tolerant Realtime Computation. http://storm.apache.org.Google Scholar
Erik Meijer. 2012. Your mouse is a database. Communications of the ACM 55, 5 (2012), 66--73. Google ScholarDigital Library
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A timely dataflow system. In ACM Symposium on Operating Systems Principles (SOSP’13).Google ScholarDigital Library
Shravan Narayanamurthy, Markus Weimer, Dhruv Mahajan, Tyson Condie, Sundararajan Sellamanickam, and S. Sathiya Keerthi. 2013. Towards resource-elastic machine learning. In BigLearn NIPS Workshop.Google Scholar
Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed stream computing platform. In IEEE International Conference on Data Mining Workshops (ICDMW’10).Google ScholarDigital Library
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. 2008. Pig latin: A not-so-foreign language for data processing. In ACM SIGMOD International Conference on Management of Data (SIGMOD’08). Google ScholarDigital Library
Ariel Rabkin. 2012. Using Program Analysis to Reduce Misconfiguration in Open Source Systems Software. Ph.D. Dissertation, UC Berkeley (2012).Google Scholar
David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1985. Learning Internal Representations by Error Propagation. Technical Report. DTIC Document. Google ScholarCross Ref
Bikas Saha, Hitesh Shah, Siddharth Seth, Gopal Vijayaraghavan, Arun Murthy, and Carlo Curino. 2015. Apache tez: A unifying framework for modeling and building data processing applications. In ACM SIGMOD International Conference on Management of Data (SIGMOD’15). Google ScholarDigital Library
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. 2013. Omega: Flexible, scalable schedulers for large compute clusters. In ACM European Conference on Computer Systems (EuroSys’13).Google ScholarDigital Library
Marc Shapiro and Nuno M. Preguiça. 2007. Designing a commutative replicated data type. CoRR abs/0710.1784.Google Scholar
Alexander Shkapsky, Mohan Yang, Matteo Interlandi, Hsuan Chiu, Tyson Condie, and Carlo Zaniolo. 2016. Big data analytics with datalog queries on spark. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York,1135--1149. DOI:http://dx.doi.org/10.1145/2882903.2915229 Google ScholarDigital Library
Alexander Smola and Shravan Narayanamurthy. 2010. An architecture for parallel topic models. Proceedings of the VLDB Endowment 3, 1--2 (2010), 703--710.Google ScholarDigital Library
Michael Stonebraker and Ugur Cetintemel. 2005. One size fits all: An idea whose time has come and gone. In International Conference on Data Engineering (ICDE’05).Google ScholarDigital Library
The Apache Software Foundation. 2017. Apache Accumulo. Retrieved from http://accumulo.apache.org/.Google Scholar
The Apache Software Foundation. 2017. Apache Giraph. Retrieved from http://giraph.apache.org/.Google Scholar
The Apache Software Foundation. 2017. Apache Hadoop. Retrieved from http://hadoop.apache.org.Google Scholar
The Apache Software Foundation. 2017. Apache Mahout. Retrieved from http://mahout.apache.org.Google Scholar
The Apache Software Foundation. 2017. Apache Slider. Retrieved from http://slider.incubator.apache.org/.Google Scholar
The Apache Software Foundation. 2017. Apache Twill. Retrieved from http://twill.apache.org/.Google Scholar
The Netty Project. 2015. Netty. Retrieved from http://netty.io.Google Scholar
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive -- A warehousing solution over a map-reduce framework. In Proceedings of the VLDB Endowment (PVLDB’09).Google Scholar
Leslie G. Valiant. 1990. A bridging model for parallel computation. Communications of the ACM 33, 8 (1990), 103--111. Google ScholarDigital Library
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache hadoop YARN: Yet another resource negotiator. In ACM Symposium on Cloud Computing (SoCC’13).Google ScholarDigital Library
Markus Weimer, Yingda Chen, Byung-Gon Chun, Tyson Condie, Carlo Curino, Chris Douglas, Yunseong Lee, Tony Majestro, Dahlia Malkhi, Sergiy Matusevych, Brandon Myers, Shravan Narayanamuthy, Raghu Ramakrishnan, Sriram Rao, Russell Sear, Beysim Sezgin, and Julia Wang. 2015. REEF: Retainable evaluator execution framework. In ACM SIGMOD International Conference on Management of Data (SIGMOD’15). Google ScholarDigital Library
Markus Weimer, Sriram Rao, and Martin Zinkevich. 2010. A convenient framework for efficient parallel multipass algorithms. In NIPS Workshop on Learning on Cores, Clusters and Clouds.Google Scholar
Matt Welsh. 2013. What I wish systems researchers would work on. Retrieved from http://matt-welsh.blogspot.com/2013/05/what-i-wish-systems-researchers-would.html.Google Scholar
Matt Welsh, David Culler, and Eric Brewer. 2001. SEDA: An architecture for well-conditioned, scalable internet services. SIGOPS Operating Systems Review 35 (2001), 230--243. Google ScholarDigital Library
Jerry Ye, Jyh-Herng Chow, Jiang Chen, and Zhaohui Zheng. 2009. Stochastic gradient boosted distributed decision trees. In ACM Conference on Information and Knowledge Management (CIKM’09). Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX Symposium on Networked Systems Design and Implementation (NSDI’12).Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In USENIX Workshop on Hot Topics in Cloud Computing (HotCloud’10).Google Scholar
Jingren Zhou, Nicolas Bruno, Ming-Chuan Wu, Per-Ake Larson, Ronnie Chaiken, and Darren Shakib. 2012. SCOPE: Parallel databases meet mapreduce. VLDB Journal 21, 5 (2012), 611--636. Google ScholarDigital Library

Index Terms

Apache REEF: Retainable Evaluator Execution Framework

Recommendations

REEF: Retainable Evaluator Execution Framework
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Resource Managers like Apache YARN have emerged as a critical layer in the cloud computing system stack, but the developer abstractions for leasing cluster resources and instantiating application logic are very low-level. This flexibility comes at a ...
Read More
Learning Apache Spark 2.0
Read More
Apache Accumulo for Developers
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Computer Systems Volume 35, Issue 2
May 2017
113 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/3129286
Editor:
Todd C. Mowry
Carnegie Mellon University, Pittsburgh, PA
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 October 2017
- Accepted: 1 July 2017
- Revised: 1 February 2017
- Received: 1 December 2015
Published in tocs Volume 35, Issue 2

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data processing
resource management
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 1,005
  Total Downloads
- Downloads (Last 12 months)78
- Downloads (Last 6 weeks)14
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Apache REEF: Retainable Evaluator Execution Framework

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

REEF: Retainable Evaluator Execution Framework

Learning Apache Spark 2.0

Apache Accumulo for Developers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Apache REEF: Retainable Evaluator Execution Framework

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

REEF: Retainable Evaluator Execution Framework

Learning Apache Spark 2.0

Apache Accumulo for Developers

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media