research-article

A case for MapReduce over the internet

Authors:
Hrishikesh Gadre

Rutgers University, Piscataway, New Jersey

Rutgers University, Piscataway, New Jersey
View Profile

,
Ivan Rodero

Rutgers University, Piscataway, New Jersey

Rutgers University, Piscataway, New Jersey
View Profile

,
Javier Diaz-Montes

Rutgers University, Piscataway, New Jersey

Rutgers University, Piscataway, New Jersey
View Profile

,
Manish Parashar

Rutgers University, Piscataway, New Jersey

Rutgers University, Piscataway, New Jersey
View Profile

CAC '13: Proceedings of the 2013 ACM Cloud and Autonomic Computing ConferenceAugust 2013Article No.: 8Pages 1–10https://doi.org/10.1145/2494621.2494632

Published:09 August 2013Publication History

CAC '13: Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference

Pages 1–10

ABSTRACT

In recent years, MapReduce programming model and specifically its open source implementation Hadoop has been widely used by organizations to perform large-scale data processing tasks such as web-indexing, data mining as well as scientific simulations. The key benefits of this programming model include its simple programming interface and ability to process massive datasets in a scalable fashion without requiring high-end computing infrastructure. We observe that the current design of Hadoop framework assumes a centralized execution environment involving a single datacenter. This assumption leads to simplified design decisions in the Hadoop architecture regarding efficient network usage, specifically in the replica-selection policy in Hadoop Distributed File System (HDFS) and in the reduce phase scheduling algorithm. In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We show that using the Hadoop framework with default policies can cause severe performance degradation in such geographically distributed environment. We propose and evaluate extensions to Hadoop MapReduce framework to improve its performance in such environments. The evaluation demonstrates that the proposed extensions substantially outperform default policies in the Hadoop framework.

References

L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23:22--28, March 2003. Google ScholarDigital Library
V. Bhat, S. Klasky, S. Atchley, M. Beck, D. McCune, and M. Parashar. High performance threaded data streaming for large scale simulations. In IEEE/ACM International Workshop on Grid Computing, pages 243--250, 2004. Google ScholarDigital Library
M. Cardosa, C. Wang, A. Nangia, A. Chandra, and J. Weissman. Exploring mapreduce efficiency with highly-distributed data. In Intl. workshop on MapReduce and its applications, MapReduce '11, pages 27--34, 2011. Google ScholarDigital Library
B. Cho and I. Gupta. New algorithms for planning bulk transfer via internet and shipping networks. In 30th International Conference on Distributed Computing Systems (ICDCS), pages 305--314, june 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107--113, January 2008. Google ScholarDigital Library
Forrester Consulting. eCommerce Web Site Performance Today: An Updated Look at Consumer Reaction to a Poor Online Shopping Experience, 2009.Google Scholar
J. F. Gantz. The Expanding Digital Universe. IDC White Paper - sponsored by EMC, 2007.Google Scholar
S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In 9th ACM symposium on Operating systems principles, SOSP '03, pages 29--43, 2003. Google ScholarDigital Library
J. Gray. A conversation with jim gray. ACM Queue, 1:8--17, 2003. Google ScholarDigital Library
Y. Gu and R. Grossman. Sector and sphere: The design and implementation of a high performance data cloud. Theme Issue of the Philosophical Transactions of the Royal Society A, 367:2429--2445, 2009.Google ScholarCross Ref
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev., 41:59--72, March 2007. Google ScholarDigital Library
T. Kosar and M. Livny. Stork: Making data placement a first class citizen in the grid. In 24th International Conference on Distributed Computing Systems (ICDCS'04), pages 342--349, 2004. Google ScholarDigital Library
H. Lin, X. Ma, J. Archuleta, W.-c. Feng, M. Gardner, and Z. Zhang. Moon: Mapreduce on opportunistic environments. In 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 95--106, 2010. Google ScholarDigital Library
N. Tolia, M. Kaminsky, D. G. Andersen, and S. Patil. An architecture for internet data transfer. In 3rd conference on Networked Systems Design & Implementation - Volume 3, NSDI'06, pages 19--19, 2006. Google ScholarDigital Library
A. Verma, L. Cherkasova, and R. H. Campbell. Aria: automatic resource inference and allocation for mapreduce environments. In 8th ACM international conference on Autonomic computing, pages 235--244, 2011. Google ScholarDigital Library
D. Villegas, N. Bobroff, I. Rodero, J. Delgado, Y. Liu, A. Devarakonda, L. Fong, S. Masoud Sadjadi, and M. Parashar. Cloud federation in a layered service model. J. Comput. Syst. Sci., 78(5):1330--1344, 2012. Google ScholarDigital Library
R. Y. Wang, S. Sobti, N. Garg, E. Ziskind, J. Lai, and A. Krishnamurthy. Turning the postal system into a generic digital communication mechanism. SIGCOMM Comput. Commun. Rev., 34:159--166, August 2004. Google ScholarDigital Library
T. White. Hadoop - The Definitive Guide, chapter 16, Case Studies - Log processing at Rackspace, pages 531--539. O'Reilly Media, second edition, 2010.Google Scholar
M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In 5th European conference on Computer systems, EuroSys '10, pages 265--278, 2010. Google ScholarDigital Library
M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In 8th USENIX conference on Operating systems design and implementation, OSDI'08, pages 29--42, 2008. Google ScholarDigital Library

Recommendations

MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Read More
Investigating MapReduce framework extensions for efficient processing of geographically scattered datasets

In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We propose an Adaptive Reduce Task Scheduling (ARTS) ...
Read More
Challenges for MapReduce in Big Data
SERVICES '14: Proceedings of the 2014 IEEE World Congress on Services

In the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CAC '13: Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
August 2013
247 pages
ISBN:9781450321723
DOI:10.1145/2494621
General Chair:
Salim Hariri
University of Arizona
,
Program Chair:
Alan Sill
Texas Tech University
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 August 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MapReduce
data center
data processing
distributed processing
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 247
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A case for MapReduce over the internet

CAC '13: Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference

ABSTRACT

References

Cited By

Recommendations

MapReduce: Review and open challenges

Investigating MapReduce framework extensions for efficient processing of geographically scattered datasets

Challenges for MapReduce in Big Data