ABSTRACT
In recent years, MapReduce programming model and specifically its open source implementation Hadoop has been widely used by organizations to perform large-scale data processing tasks such as web-indexing, data mining as well as scientific simulations. The key benefits of this programming model include its simple programming interface and ability to process massive datasets in a scalable fashion without requiring high-end computing infrastructure. We observe that the current design of Hadoop framework assumes a centralized execution environment involving a single datacenter. This assumption leads to simplified design decisions in the Hadoop architecture regarding efficient network usage, specifically in the replica-selection policy in Hadoop Distributed File System (HDFS) and in the reduce phase scheduling algorithm. In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We show that using the Hadoop framework with default policies can cause severe performance degradation in such geographically distributed environment. We propose and evaluate extensions to Hadoop MapReduce framework to improve its performance in such environments. The evaluation demonstrates that the proposed extensions substantially outperform default policies in the Hadoop framework.
- L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23:22--28, March 2003. Google ScholarDigital Library
- V. Bhat, S. Klasky, S. Atchley, M. Beck, D. McCune, and M. Parashar. High performance threaded data streaming for large scale simulations. In IEEE/ACM International Workshop on Grid Computing, pages 243--250, 2004. Google ScholarDigital Library
- M. Cardosa, C. Wang, A. Nangia, A. Chandra, and J. Weissman. Exploring mapreduce efficiency with highly-distributed data. In Intl. workshop on MapReduce and its applications, MapReduce '11, pages 27--34, 2011. Google ScholarDigital Library
- B. Cho and I. Gupta. New algorithms for planning bulk transfer via internet and shipping networks. In 30th International Conference on Distributed Computing Systems (ICDCS), pages 305--314, june 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107--113, January 2008. Google ScholarDigital Library
- Forrester Consulting. eCommerce Web Site Performance Today: An Updated Look at Consumer Reaction to a Poor Online Shopping Experience, 2009.Google Scholar
- J. F. Gantz. The Expanding Digital Universe. IDC White Paper - sponsored by EMC, 2007.Google Scholar
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In 9th ACM symposium on Operating systems principles, SOSP '03, pages 29--43, 2003. Google ScholarDigital Library
- J. Gray. A conversation with jim gray. ACM Queue, 1:8--17, 2003. Google ScholarDigital Library
- Y. Gu and R. Grossman. Sector and sphere: The design and implementation of a high performance data cloud. Theme Issue of the Philosophical Transactions of the Royal Society A, 367:2429--2445, 2009.Google ScholarCross Ref
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev., 41:59--72, March 2007. Google ScholarDigital Library
- T. Kosar and M. Livny. Stork: Making data placement a first class citizen in the grid. In 24th International Conference on Distributed Computing Systems (ICDCS'04), pages 342--349, 2004. Google ScholarDigital Library
- H. Lin, X. Ma, J. Archuleta, W.-c. Feng, M. Gardner, and Z. Zhang. Moon: Mapreduce on opportunistic environments. In 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 95--106, 2010. Google ScholarDigital Library
- N. Tolia, M. Kaminsky, D. G. Andersen, and S. Patil. An architecture for internet data transfer. In 3rd conference on Networked Systems Design & Implementation - Volume 3, NSDI'06, pages 19--19, 2006. Google ScholarDigital Library
- A. Verma, L. Cherkasova, and R. H. Campbell. Aria: automatic resource inference and allocation for mapreduce environments. In 8th ACM international conference on Autonomic computing, pages 235--244, 2011. Google ScholarDigital Library
- D. Villegas, N. Bobroff, I. Rodero, J. Delgado, Y. Liu, A. Devarakonda, L. Fong, S. Masoud Sadjadi, and M. Parashar. Cloud federation in a layered service model. J. Comput. Syst. Sci., 78(5):1330--1344, 2012. Google ScholarDigital Library
- R. Y. Wang, S. Sobti, N. Garg, E. Ziskind, J. Lai, and A. Krishnamurthy. Turning the postal system into a generic digital communication mechanism. SIGCOMM Comput. Commun. Rev., 34:159--166, August 2004. Google ScholarDigital Library
- T. White. Hadoop - The Definitive Guide, chapter 16, Case Studies - Log processing at Rackspace, pages 531--539. O'Reilly Media, second edition, 2010.Google Scholar
- M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In 5th European conference on Computer systems, EuroSys '10, pages 265--278, 2010. Google ScholarDigital Library
- M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In 8th USENIX conference on Operating systems design and implementation, OSDI'08, pages 29--42, 2008. Google ScholarDigital Library
Recommendations
MapReduce: Review and open challenges
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Investigating MapReduce framework extensions for efficient processing of geographically scattered datasets
In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We propose an Adaptive Reduce Task Scheduling (ARTS) ...
Challenges for MapReduce in Big Data
SERVICES '14: Proceedings of the 2014 IEEE World Congress on ServicesIn the Big Data community, MapReduce has been seen as one of the key enabling approaches for meeting continuously increasing demands on computing resources imposed by massive data sets. The reason for this is the high scalability of the MapReduce ...
Comments