skip to main content
10.1145/2494621.2494632acmotherconferencesArticle/Chapter ViewAbstractPublication PagescacConference Proceedingsconference-collections
research-article

A case for MapReduce over the internet

Published:09 August 2013Publication History

ABSTRACT

In recent years, MapReduce programming model and specifically its open source implementation Hadoop has been widely used by organizations to perform large-scale data processing tasks such as web-indexing, data mining as well as scientific simulations. The key benefits of this programming model include its simple programming interface and ability to process massive datasets in a scalable fashion without requiring high-end computing infrastructure. We observe that the current design of Hadoop framework assumes a centralized execution environment involving a single datacenter. This assumption leads to simplified design decisions in the Hadoop architecture regarding efficient network usage, specifically in the replica-selection policy in Hadoop Distributed File System (HDFS) and in the reduce phase scheduling algorithm. In this paper, we investigate real-world scenarios in which MapReduce programming model and specifically Hadoop framework could be used for processing large-scale, geographically scattered datasets. We show that using the Hadoop framework with default policies can cause severe performance degradation in such geographically distributed environment. We propose and evaluate extensions to Hadoop MapReduce framework to improve its performance in such environments. The evaluation demonstrates that the proposed extensions substantially outperform default policies in the Hadoop framework.

References

  1. L. A. Barroso, J. Dean, and U. Hölzle. Web search for a planet: The google cluster architecture. IEEE Micro, 23:22--28, March 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. Bhat, S. Klasky, S. Atchley, M. Beck, D. McCune, and M. Parashar. High performance threaded data streaming for large scale simulations. In IEEE/ACM International Workshop on Grid Computing, pages 243--250, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Cardosa, C. Wang, A. Nangia, A. Chandra, and J. Weissman. Exploring mapreduce efficiency with highly-distributed data. In Intl. workshop on MapReduce and its applications, MapReduce '11, pages 27--34, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Cho and I. Gupta. New algorithms for planning bulk transfer via internet and shipping networks. In 30th International Conference on Distributed Computing Systems (ICDCS), pages 305--314, june 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107--113, January 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Forrester Consulting. eCommerce Web Site Performance Today: An Updated Look at Consumer Reaction to a Poor Online Shopping Experience, 2009.Google ScholarGoogle Scholar
  7. J. F. Gantz. The Expanding Digital Universe. IDC White Paper - sponsored by EMC, 2007.Google ScholarGoogle Scholar
  8. S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In 9th ACM symposium on Operating systems principles, SOSP '03, pages 29--43, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Gray. A conversation with jim gray. ACM Queue, 1:8--17, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Gu and R. Grossman. Sector and sphere: The design and implementation of a high performance data cloud. Theme Issue of the Philosophical Transactions of the Royal Society A, 367:2429--2445, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  11. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS Oper. Syst. Rev., 41:59--72, March 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Kosar and M. Livny. Stork: Making data placement a first class citizen in the grid. In 24th International Conference on Distributed Computing Systems (ICDCS'04), pages 342--349, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H. Lin, X. Ma, J. Archuleta, W.-c. Feng, M. Gardner, and Z. Zhang. Moon: Mapreduce on opportunistic environments. In 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 95--106, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Tolia, M. Kaminsky, D. G. Andersen, and S. Patil. An architecture for internet data transfer. In 3rd conference on Networked Systems Design & Implementation - Volume 3, NSDI'06, pages 19--19, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Verma, L. Cherkasova, and R. H. Campbell. Aria: automatic resource inference and allocation for mapreduce environments. In 8th ACM international conference on Autonomic computing, pages 235--244, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. D. Villegas, N. Bobroff, I. Rodero, J. Delgado, Y. Liu, A. Devarakonda, L. Fong, S. Masoud Sadjadi, and M. Parashar. Cloud federation in a layered service model. J. Comput. Syst. Sci., 78(5):1330--1344, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. Y. Wang, S. Sobti, N. Garg, E. Ziskind, J. Lai, and A. Krishnamurthy. Turning the postal system into a generic digital communication mechanism. SIGCOMM Comput. Commun. Rev., 34:159--166, August 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. White. Hadoop - The Definitive Guide, chapter 16, Case Studies - Log processing at Rackspace, pages 531--539. O'Reilly Media, second edition, 2010.Google ScholarGoogle Scholar
  19. M. Zaharia, D. Borthakur, J. Sen Sarma, K. Elmeleegy, S. Shenker, and I. Stoica. Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In 5th European conference on Computer systems, EuroSys '10, pages 265--278, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Zaharia, A. Konwinski, A. D. Joseph, R. Katz, and I. Stoica. Improving mapreduce performance in heterogeneous environments. In 8th USENIX conference on Operating systems design and implementation, OSDI'08, pages 29--42, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Other conferences
    CAC '13: Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference
    August 2013
    247 pages
    ISBN:9781450321723
    DOI:10.1145/2494621

    Copyright © 2013 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 9 August 2013

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader