research-article

Nephele: efficient parallel data processing in the cloud

Authors:
Daniel Warneke

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

,
Odej Kao

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and SupercomputersNovember 2009Article No.: 8Pages 1–10https://doi.org/10.1145/1646468.1646476

Published:16 November 2009Publication History

MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers

Pages 1–10

ABSTRACT

In recent years Cloud Computing has emerged as a promising new approach for ad-hoc parallel data processing. Major cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for customers to access these services and to deploy their programs. However, the processing frameworks which are currently used stem from the field of cluster computing and disregard the particular nature of a cloud. As a result, the allocated compute resources may be inadequate for big parts of the submitted job and unnecessarily increase processing time and cost. In this paper we discuss the opportunities and challenges for efficient parallel data processing in clouds and present our ongoing research project Nephele. Nephele is the first data processing framework to explicitly exploit the dynamic resource allocation offered by today's compute clouds for both, task scheduling and execution. It allows assigning the particular tasks of a processing job to different types of virtual machines and takes care of their instantiation and termination during the job execution. Based on this new framework, we perform evaluations on a compute cloud system and compare the results to the existing data processing framework Hadoop.

References

Amazon Web Services LLC. Amazon Elastic Compute Cloud (Amazon EC2). http://aws.amazon.com/ec2/, 2009.Google Scholar
Amazon Web Services LLC. Amazon Elastic MapReduce. http://aws.amazon.com/elasticmapreduce/, 2009.Google Scholar
Amazon Web Services LLC. Amazon Simple Storage Service. http://aws.amazon.com/s3/, 2009.Google Scholar
A. Andrieux, K. Czajkowski, A. Dan, K. Keahey, H. Ludwig, T. Kakata, J. Pruyne, J. Rofrano, S. Tuecke, and M. Xu. Web Services Agreement Specification (WS-Agreement). Technical report, Open Grid Forum, 2007.Google Scholar
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008. Google ScholarDigital Library
H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pages 1029--1040, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
M. Coates, R. Castro, R. Nowak, M. Gadhiok, R. King, and Y. Tsang. Maximum likelihood network topology identification from edge-based unicast measurements. SIGMETRICS Perform. Eval. Rev., 30(1):11--20, 2002. Google ScholarDigital Library
R. Davoli. VDE: Virtual Distributed Ethernet. Testbeds and Research Infrastructures for the Development of Networks&Communities, International Conference on, 0:213--220, 2005. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design&Implementation, pages 10--10, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good, A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Sci. Program., 13(3):219--237, 2005. Google ScholarDigital Library
I. Foster and C. Kesselman. Globus: A Metacomputing Infrastructure Toolkit. Intl. Journal of Supercomputer Applications, 11(2):115--128, 1997.Google ScholarDigital Library
J. Frey, T. Tannenbaum, M. Livny, I. Foster, and S. Tuecke. Condor-G: A computation management agent for multi-institutional grids. Cluster Computing, 5(3):237--246, 2002. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: distributed data-parallel programs from sequential building blocks. In EuroSys '07: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, pages 59--72, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
A. Kivity. kvm: the Linux virtual machine monitor. In OLS '07: The 2007 Ottawa Linux Symposium, pages 225--230, July 2007.Google Scholar
D. Nurmi, R. Wolski, C. Grzegorczyk, G. Obertelli, S. Soman, L. Youseff, and D. Zagorodnov. Eucalyptus: A Technical Report on an Elastic Utility Computing Architecture Linking Your Programs to Useful Systems. Technical report, University of California, Santa Barbara, 2008.Google Scholar
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages 1099--1110, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
O. O'Malley and A. C. Murthy. Winning a 60 Second Dash with a Yellow Elephant. Technical report, Yahoo!, 2009.Google Scholar
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Sci. Program., 13(4):277--298, 2005. Google ScholarDigital Library
I. Raicu, I. Foster, and Y. Zhao. Many-task computing for grids and supercomputers. In Many-Task Computing on Grids and Supercomputers, 2008. MTAGS 2008. Workshop on, pages 1--11, Nov. 2008.Google ScholarCross Ref
I. Raicu, Y. Zhao, C. Dumitrescu, I. Foster, and M. Wilde. Falkon: a Fast and Light-weight tasK executiON framework. In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
R. Russell. virtio: towards a de-facto standard for virtual I/O devices. SIGOPS Oper. Syst. Rev., 42(5):95--103, 2008. Google ScholarDigital Library
M. Stillger, G. M. Lohman, V. Markl, and M. Kandil. Leo - DB2's LEarning Optimizer. In VLDB '01: Proceedings of the 27th International Conference on Very Large Data Bases, pages 19--28, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
The Apache Software Foundation. Welcome to Hadoop! http://hadoop.apache.org/, 2009.Google Scholar
G. von Laszewski, M. Hategan, and D. Kodeboyina. Workflows for e-Science Scientific Workflows for Grids. Springer, 2007.Google Scholar
T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarDigital Library
Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. von Laszewski, V. Nefedova, I. Raicu, T. Stef-Praun, and M. Wilde. Swift: Fast, Reliable, Loosely Coupled Parallel Computation. In Services, 2007 IEEE Congress on, pages 199--206, July 2007.Google Scholar

Index Terms

Nephele: efficient parallel data processing in the cloud
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Distributed programming languages

Recommendations

Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud

In recent years ad hoc parallel data processing has emerged to be one of the killer applications for Infrastructure-as-a-Service (IaaS) clouds. Major Cloud computing companies have started to integrate frameworks for parallel data processing in their ...
Read More
Simplifying the Use of Clouds for Scientific Computing with Everest

Cloud computing has emerged as a new paradigm for on-demand access to a wast pool of computing resources that provides an alternative to using on-premises resources. This paper discusses the challenges related to using the cloud computing ...
Read More
On the role of application and resource characterizations in heterogeneous distributed computing systems

Loosely coupled applications composed of a potentially very large number (from tens of thousands to even billions) of tasks are commonly used in high-throughput computing and many-task computing paradigms. To efficiently execute large-scale computations ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers
November 2009
131 pages
ISBN:9781605587141
DOI:10.1145/1646468
Conference Chairs:
Ioan Raicu
Northwestern University
,
Ian Foster
University of Chicago & Argonne National Laboratory
,
Yong Zhao
Microsoft
Copyright © 2009 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 November 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud computing
high-throughput computing
loosely coupled applications
many-task computing
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 80
  Total Citations
  View Citations
- 1,196
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Nephele: efficient parallel data processing in the cloud

MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exploiting Dynamic Resource Allocation for Efficient Parallel Data Processing in the Cloud

Simplifying the Use of Clouds for Scientific Computing with Everest

On the role of application and resource characterizations in heterogeneous distributed computing systems