skip to main content
10.1145/2169090.2169092acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Nobody ever got fired for using Hadoop on a cluster

Published: 10 April 2012 Publication History

Abstract

The norm for data analytics is now to run them on commodity clusters with MapReduce-like abstractions. One only needs to read the popular blogs to see the evidence of this. We believe that we could now say that "nobody ever got fired for using Hadoop on a cluster"!
We completely agree that Hadoop on a cluster is the right solution for jobs where the input data is multi-terabyte or larger. However, in this position paper we ask if this is the right path for general purpose data analytics? Evidence suggests that many MapReduce-like jobs process relatively small input data sets (less than 14 GB). Memory has reached a GB/$ ratio such that it is now technically and financially feasible to have servers with 100s GB of DRAM. We therefore ask, should we be scaling by using single machines with very large memories rather than clusters? We conjecture that, in terms of hardware and programmer time, this may be a better option for the majority of data processing jobs.

References

[1]
Apache Hadoop. http://hadoop.apache.org/.
[2]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In VLDB, pages 487--499, 1994.
[3]
G. Ananthanarayanan, A. Ghodsi, AndrewWang, D. Borthakur, S. Kandula, S. Shenker, and I. Stoica. PACMan: Coordinated memory caching for parallel jobs. In NSDI, Apr. 2012.
[4]
P. Costa, A. Donnelly, G. O'Shea, and A. Rowstron. CamCube: A Key-based Data Center. Technical Report MSR TR-2010-74, 2010.
[5]
P. Costa, A. Donnelly, A. Rowstron, and G. O'Shea. Camdoop: Exploiting in-network aggregation for big data applications. In NSDI, Apr. 2012.
[6]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004.
[7]
K. Elmeleegy. Piranha: Optimizing short jobs in hadoop. Under submission, 2012.
[8]
T. Graepel, J. Q. Candela, T. Borchert, and R. Herbrich. Web-scale bayesian click-through rate prediction for sponsored search advertising in Microsofts Bing search engine. In 27th ICML, June 2010.
[9]
A. Savasere, E. Omiecinski, and S. B. Navathe. An efficient algorithm for mining association rules in large databases. In VLDB, pages 432--444, 1995.

Cited By

View all
  • (2021)Performance Analysis and Auto-tuning for SPARK in-memory analytics2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474122(76-81)Online publication date: 1-Feb-2021
  • (2021)TipTap: Approximate Mining of Frequent k-Subgraph Patterns in Evolving GraphsACM Transactions on Knowledge Discovery from Data10.1145/344259015:3(1-35)Online publication date: 21-Apr-2021
  • (2021)Magas: matrix-based asynchronous graph analytics on shared memory systemsThe Journal of Supercomputing10.1007/s11227-021-04091-x78:4(5650-5680)Online publication date: 1-Oct-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
HotCDP '12: Proceedings of the 1st International Workshop on Hot Topics in Cloud Data Processing
April 2012
26 pages
ISBN:9781450311625
DOI:10.1145/2169090
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 10 April 2012

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hadoop
  2. MapReduce
  3. analytics
  4. big data
  5. scalability

Qualifiers

  • Research-article

Conference

EuroSys '12
Sponsor:
EuroSys '12: Seventh EuroSys Conference 2012
April 10, 2012
Bern, Switzerland

Upcoming Conference

EuroSys '25
Twentieth European Conference on Computer Systems
March 30 - April 3, 2025
Rotterdam , Netherlands

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)10
Reflects downloads up to 22 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2021)Performance Analysis and Auto-tuning for SPARK in-memory analytics2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE51398.2021.9474122(76-81)Online publication date: 1-Feb-2021
  • (2021)TipTap: Approximate Mining of Frequent k-Subgraph Patterns in Evolving GraphsACM Transactions on Knowledge Discovery from Data10.1145/344259015:3(1-35)Online publication date: 21-Apr-2021
  • (2021)Magas: matrix-based asynchronous graph analytics on shared memory systemsThe Journal of Supercomputing10.1007/s11227-021-04091-x78:4(5650-5680)Online publication date: 1-Oct-2021
  • (2020)Resource-Aware MapReduce Runtime for Multi/Many-core Architectures2020 Design, Automation & Test in Europe Conference & Exhibition (DATE)10.23919/DATE48585.2020.9116281(897-902)Online publication date: Mar-2020
  • (2019)On the Effectiveness of Hybrid Canopy With Hoeffding Adaptive Naive Bayes TreesWeb Services10.4018/978-1-5225-7501-6.ch043(788-802)Online publication date: 2019
  • (2018)FlareProceedings of the 13th USENIX conference on Operating Systems Design and Implementation10.5555/3291168.3291227(799-815)Online publication date: 8-Oct-2018
  • (2018)Beyond macrobenchmarksProceedings of the VLDB Endowment10.14778/3297753.329775912:4(390-403)Online publication date: 1-Dec-2018
  • (2018)Scalable Graph Processing FrameworksACM Computing Surveys10.1145/319952351:3(1-53)Online publication date: 12-Jun-2018
  • (2018)Investigation of Replication Factor for Performance Enhancement in the Hadoop Distributed File SystemCompanion of the 2018 ACM/SPEC International Conference on Performance Engineering10.1145/3185768.3186359(135-140)Online publication date: 2-Apr-2018
  • (2018)POSUM: A Portfolio Scheduler for MapReduce Workloads2018 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2018.8622215(351-357)Online publication date: Dec-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media