research-article

Apache Hadoop YARN: yet another resource negotiator

Authors:
Vinod Kumar Vavilapalli

hortonworks.com

hortonworks.com
View Profile

,
Arun C. Murthy

hortonworks.com

hortonworks.com
View Profile

,
Chris Douglas

microsoft.com

microsoft.com
View Profile

,
Sharad Agarwal

inmobi.com

inmobi.com
View Profile

,
Mahadev Konar

hortonworks.com

hortonworks.com
View Profile

,
Robert Evans

yahoo-inc.com

yahoo-inc.com
View Profile

,
Thomas Graves

yahoo-inc.com

yahoo-inc.com
View Profile

,
Jason Lowe

yahoo-inc.com

yahoo-inc.com
View Profile

,
Hitesh Shah

hortonworks.com

hortonworks.com
View Profile

,
Siddharth Seth

hortonworks.com

hortonworks.com
View Profile

,
Bikas Saha

hortonworks.com

hortonworks.com
View Profile

,
Carlo Curino

microsoft.com

microsoft.com
View Profile

,
Owen O'Malley

hortonworks.com

hortonworks.com
View Profile

,
Sanjay Radia

hortonworks.com

hortonworks.com
View Profile

,
Benjamin Reed

facebook.com

facebook.com
View Profile

,
Eric Baldeschwieler

hortonworks.com

hortonworks.com
View Profile

SOCC '13: Proceedings of the 4th annual Symposium on Cloud ComputingOctober 2013Article No.: 5Pages 1–16https://doi.org/10.1145/2523616.2523633

Published:01 October 2013Publication History

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

Pages 1–16

ABSTRACT

The initial design of Apache Hadoop [1] was tightly focused on running massive, MapReduce jobs to process a web crawl. For increasingly diverse companies, Hadoop has become the data and computational agorá---the de facto place where data and computational resources are shared and accessed. This broad adoption and ubiquitous usage has stretched the initial design well beyond its intended target, exposing two key shortcomings: 1) tight coupling of a specific programming model with the resource management infrastructure, forcing developers to abuse the MapReduce programming model, and 2) centralized handling of jobs' control flow, which resulted in endless scalability concerns for the scheduler.

In this paper, we summarize the design, development, and current state of deployment of the next generation of Hadoop's compute platform: YARN. The new architecture we introduced decouples the programming model from the resource management infrastructure, and delegates many scheduling functions (e.g., task fault-tolerance) to per-application components. We provide experimental evidence demonstrating the improvements we made, confirm improved efficiency by reporting the experience of running YARN on production environments (including 100% of Yahoo! grids), and confirm the flexibility claims by discussing the porting of several programming frameworks onto YARN viz. Dryad, Giraph, Hoya, Hadoop MapReduce, REEF, Spark, Storm, Tez.

References

Apache hadoop. http://hadoop.apache.org.Google Scholar
Apache tez. http://incubator.apache.org/projects/tez.html.Google Scholar
Netty project. http://netty.io.Google Scholar
Storm. http://storm-project.net/.Google Scholar
H. Ballani, P. Costa, T. Karagiannis, and A. I. Rowstron. Towards predictable datacenter networks. In SIGCOMM, volume 11, pages 242--253, 2011. Google ScholarDigital Library
F. P. Brooks, Jr. The mythical man-month (anniversary ed.). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1995. Google ScholarDigital Library
N. Capit, G. Da Costa, Y. Georgiou, G. Huard, C. Martin, G. Mounie, P. Neyron, and O. Richard. A batch scheduler with high level components. In Cluster Computing and the Grid, 2005. CCGrid 2005. IEEE International Symposium on, volume 2, pages 776--783 Vol. 2, 2005. Google ScholarDigital Library
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2): 1265--1276, Aug. 2008. Google ScholarDigital Library
M. Chowdhury, M. Zaharia, J. Ma, M. I. Jordan, and I. Stoica. Managing data transfers in computer clusters with orchestra. SIGCOMM-Computer Communication Review, 41(4): 98, 2011. Google ScholarDigital Library
B.-G. Chun, T. Condie, C. Curino, R. Ramakrishnan, R. Sears, and M. Weimer. Reef: Retainable evaluator execution framework. In VLDB 2013, Demo, 2013. Google ScholarDigital Library
B. F. Cooper, E. Baldeschwieler, R. Fonseca, J. J. Kistler, P. Narayan, C. Neerdaels, T. Negrin, R. Ramakrishnan, A. Silberstein, U. Srivastava, et al. Building a cloud for Yahoo! IEEE Data Eng. Bull., 32(1): 36--43, 2009.Google Scholar
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1): 107--113, Jan. 2008. Google ScholarDigital Library
W. Emeneker, D. Jackson, J. Butikofer, and D. Stanzione. Dynamic virtual clustering with xen and moab. In G. Min, B. Martino, L. Yang, M. Guo, and G. Rnger, editors, Frontiers of High Performance Computing and Networking, ISPA 2006 Workshops, volume 4331 of Lecture Notes in Computer Science, pages 440--451. Springer Berlin Heidelberg, 2006. Google ScholarDigital Library
Facebook Engineering Team. Under the Hood: Scheduling MapReduce jobs more efficiently with Corona. http://on.fb.me/TxUsYN, 2012.Google Scholar
D. Gottfrid. Self-service prorated super-computing fun. http://open.blogs.nytimes.com/2007/11/01/self-service-prorated-super-computing-fun, 2007.Google Scholar
T. Graves. GraySort and MinuteSort at Yahoo on Hadoop 0.23. http://sortbenchmark.org/Yahoo2013Sort.pdf, 2013.Google Scholar
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: a platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, NSDI'11, pages 22--22, Berkeley, CA, USA, 2011. USENIX Association. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, EuroSys '07, pages 59--72, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
M. Islam, A. K. Huang, M. Battisha, M. Chiang, S. Srinivasan, C. Peters, A. Neumann, and A. Abdelnur. Oozie: towards a scalable workflow management system for hadoop. In Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, page 4. ACM, 2012. Google ScholarDigital Library
D. B. Jackson, Q. Snell, and M. J. Clement. Core algorithms of the maui scheduler. In Revised Papers from the 7th International Workshop on Job Scheduling Strategies for Parallel Processing, JSSPP '01, pages 87--102, London, UK, UK, 2001. Springer-Verlag. Google ScholarDigital Library
S. Loughran, D. Das, and E. Baldeschwieler. Introducing Hoya -- HBase on YARN. http://hortonworks.com/blog/introducing-hoya-hbase-on-yarn/, 2013.Google Scholar
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, SIGMOD '10, pages 135--146, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
R. O. Nambiar and M. Poess. The making of tpc-ds. In Proceedings of the 32nd international conference on Very large data bases, VLDB '06, pages 1049--1058. VLDB Endowment, 2006. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data, SIGMOD '08, pages 1099--1110, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
O. O'Malley. Hadoop: The Definitive Guide, chapter Hadoop at Yahoo!, pages 11--12. O'Reilly Media, 2012.Google Scholar
M. Schwarzkopf, A. Konwinski, M. Abd-El-Malek, and J. Wilkes. Omega: flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys '13, pages 351--364, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The Hadoop Distributed File System. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST '10, pages 1--10, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
T.-W. N. Sze. The two quadrillionth bit of π is 0! http://developer.yahoo.com/blogs/hadoop/two-quadrillionth-bit-0-467.html.Google Scholar
D. Thain, T. Tannenbaum, and M. Livny. Distributed computing in practice: the Condor experience. Concurrency and Computation: Practice and Experience, 17(2--4): 323--356, 2005. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Z. 0002, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using Hadoop. In F. Li, M. M. Moro, S. Ghande-harizadeh, J. R. Haritsa, G. Weikum, M. J. Carey, F. Casati, E. Y. Chang, I. Manolescu, S. Mehrotra, U. Dayal, and V. J. Tsotras, editors, Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, March 1--6, 2010, Long Beach, California, USA, pages 996--1005. IEEE, 2010.Google Scholar
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: a system for general-purpose distributed data-parallel computing using a high-level language. In Proceedings of the 8th USENIX conference on Operating systems design and implementation, OSDI'08, pages 1--14, Berkeley, CA, USA, 2008. USENIX Association. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, HotCloud'10, pages 10--10, Berkeley, CA, USA, 2010. USENIX Association. Google ScholarDigital Library

Index Terms

Apache Hadoop YARN: yet another resource negotiator

Recommendations

Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2
Read More
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research

The term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...
Read More
Pro Apache Hadoop
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing
October 2013
427 pages
ISBN:9781450324281
DOI:10.1145/2523616
General Chair:
Guy Lohman
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SOCC '13 Paper Acceptance Rate23of114submissions,20%Overall Acceptance Rate169of722submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1,324
  Total Citations
  View Citations
- 9,835
  Total Downloads
- Downloads (Last 12 months)478
- Downloads (Last 6 weeks)85
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Apache Hadoop YARN: yet another resource negotiator

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2

Performance comparison of Apache Hadoop and Apache Spark

Pro Apache Hadoop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Apache Hadoop YARN: yet another resource negotiator

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2

Performance comparison of Apache Hadoop and Apache Spark

Pro Apache Hadoop

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media