skip to main content
10.1145/1646468.1646471acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Lessons learned from a year's worth of benchmarks of large data clouds

Published: 16 November 2009 Publication History

Abstract

In this paper, we discuss some of the lessons that we have learned working with the Hadoop and Sector/Sphere systems. Both of these systems are cloud-based systems designed to support data intensive computing. Both include distributed file systems and closely coupled systems for processing data in parallel. Hadoop uses MapReduce, while Sphere supports the ability to execute an arbitrary user defined function over the data managed by Sector. We compare and contrast these systems and discuss some of the design trade-offs necessary in data intensive computing. In our experimental studies over the past year, Sector/Sphere has consistently performed about 2--4 times faster than Hadoop. We discuss some of the reasons that might be responsible for this difference in performance.

References

[1]
Collin Bennett, Robert Grossman, and Jonathan Seidman, Open Cloud Consortium Technical Report TR-09-01, MalStone: A Benchmark for Data Intensive Computing, Apr. 2009.
[2]
Beynon, Michael D. and Kurc, Tahsin and Catalyurek, Umit and Chang, Chialin and Sussman, Alan and Saltz, Joel, Distributed processing of very large datasets with DataCutter, Journal of Parallel Computing, Vol. 27, 2001. Pages 1457--1478.
[3]
J. Bent, D. Thain, A. Arpaci-Dusseau, and R. Arpaci-Dusseau, "Explicit control in a batch-aware distributed file system," in Proceedings of the First USENIX/ACM Conference on Networked Systems Design and Implementation, March 2004.
[4]
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006.
[5]
Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004.
[6]
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, pub. 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003.
[7]
Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta, VL2: A Scalable and Flexible Data Center Network, SIGCOMM 2009.
[8]
Yunhong Gu and Robert Grossman, Exploring Data Parallelism and Locality in Wide Area Networks, Workshop on Many-task Computing on Grids and Supercomputers (MTAGS), co-located with SC08, Austin, TX. Nov. 2008
[9]
Yunhong Gu, Robert Grossman, UDT: UDP-based data transfer for high-speed networks, Computer Networks (Elsevier), Volume 51, Issue 7. May 2007.
[10]
Yunhong Gu, Robert L. Grossman, Alex Szalay and Ani Thakar, Distributing the Sloan Digital Sky Survey Using UDT and Sector, Proceedings of e-Science 2006.
[11]
Tevfik Kosar and Miron Livny, Stork: Making Data Placement a First Class Citizen in the Grid, in Proceedings of 24th IEEE International Conference on Distributed Computing Systems (ICDCS 2004), Tokyo, Japan, March 2004.
[12]
T. Kurc, Umit Catalyurek, C. Chang, A. Sussman, and J. Salz. Exploration and visualization of very large datasets with the Active Data Repository. Technical Report CS-TR4208, University of Maryland, 2001.
[13]
I. Raicu, Z. Zhang, M. Wilde, I. Foster, P. Beckman, K. Iskra, and B. Clifford, Toward Loosely Coupled Programming on Petascale Systems, Proceedings of the 20th ACM/IEEE Conference on Supercomputing.
[14]
Douglas Thain, Todd Tannenbaum, and Miron Livny, "Distributed Computing in Practice: The Condor Experience" Concurrency and Computation: Practice and Experience, Vol. 17, No. 2--4, pages 323--356, February-April, 2005.
[15]
Hadoop, hadoop.apache.org/core, Retrieved in Oct. 2009.
[16]
The Open Cloud Testbed, http://www.opencloudconsortium.org.
[17]
Ioan Raicu, Ian Foster, Yong Zhao, Many-Task Computing for Grids and Supercomputers, Workshop on Many-task Computing on Grids and Supercomputers (MTAGS), co-located with SC08, Austin, TX. Nov. 2008

Cited By

View all
  • (2016)Virtualization technologies for the big data environmentProceedings of the 31st Annual ACM Symposium on Applied Computing10.1145/2851613.2851881(542-545)Online publication date: 4-Apr-2016
  • (2016)The HPCC/ECL Platform for Big DataBig Data Technologies and Applications10.1007/978-3-319-44550-2_6(159-183)Online publication date: 17-Sep-2016
  • (2012)A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce EnvironmentProceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing10.1109/UCC.2012.32(161-167)Online publication date: 5-Nov-2012
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
November 2009
131 pages
ISBN:9781605587141
DOI:10.1145/1646468
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 November 2009

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. MapReduce
  2. cloud computing
  3. data intensive computing
  4. grid computing
  5. high performance computing
  6. multi-task computing

Qualifiers

  • Research-article

Funding Sources

Conference

SC '09
Sponsor:

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 07 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2016)Virtualization technologies for the big data environmentProceedings of the 31st Annual ACM Symposium on Applied Computing10.1145/2851613.2851881(542-545)Online publication date: 4-Apr-2016
  • (2016)The HPCC/ECL Platform for Big DataBig Data Technologies and Applications10.1007/978-3-319-44550-2_6(159-183)Online publication date: 17-Sep-2016
  • (2012)A Hybrid Scheduling Algorithm for Data Intensive Workloads in a MapReduce EnvironmentProceedings of the 2012 IEEE/ACM Fifth International Conference on Utility and Cloud Computing10.1109/UCC.2012.32(161-167)Online publication date: 5-Nov-2012
  • (2012)FRIEDAProceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis10.1109/SC.Companion.2012.132(1096-1105)Online publication date: 10-Nov-2012
  • (2012)Evaluating Hadoop for Data-Intensive Scientific OperationsProceedings of the 2012 IEEE Fifth International Conference on Cloud Computing10.1109/CLOUD.2012.118(67-74)Online publication date: 24-Jun-2012
  • (2011)A MapReduce workflow system for architecting scientific data intensive applicationsProceedings of the 2nd International Workshop on Software Engineering for Cloud Computing10.1145/1985500.1985510(57-63)Online publication date: 22-May-2011
  • (2011)A Survey of Large Scale Data Management Approaches in Cloud EnvironmentsIEEE Communications Surveys & Tutorials10.1109/SURV.2011.032211.0008713:3(311-336)Online publication date: 2011
  • (2011)ECL/HPCC: A Unified Approach to Big DataHandbook of Data Intensive Computing10.1007/978-1-4614-1415-5_3(59-107)Online publication date: 11-Nov-2011
  • (2011)Distributed and Cloud ComputingundefinedOnline publication date: 31-Oct-2011
  • (2010)Processing massive sized graphs using Sector/Sphere2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers10.1109/MTAGS.2010.5699427(1-10)Online publication date: Nov-2010
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media