skip to main content
10.1145/1985500.1985510acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

A MapReduce workflow system for architecting scientific data intensive applications

Published: 22 May 2011 Publication History

Abstract

MapReduce is promising for developing both scalable business and scientific data intensive applications. However, there are few existing scientific workflow systems which can benefit from the MapReduce programming model. We propose a workflow system for integrating structure, and orchestrating MapReduce jobs for scientific data intensive workflows. The system consists of a simple workflow design C++ API, a job scheduler, and a runtime support system for Hadoop or Sector/Sphere frameworks. A climate satellite data intensive processing and analysis application is developed as a use case and an evaluation for the workflow system. The evaluation shows that it is possible to make the steps in the climate data intensive application automatically from data gridding to complex data analysis using the workflow system. The performance of the climate analysis application is significantly improved by the enabled MapReduce workflow system compared with the sequential embarrassing parallel methods. The overhead of the workflow system is negligible. However, the graphic user interface is still under development for the workflow system.

References

[1]
Kouzes, Richard T. Anderson, Gordon A. Elbert, Stephen T. Gorton, Ian Gracio. The Changing Paradigm of Data-Intensive Computing. IEEE Computer Society Press Volume 42, Issue 1, January 2009.
[2]
Gideon Juve, Ewa Deelman, et al. Scientific Workflow Applications on Amazon EC2. Workshop on Cloud-based Services and Applications in conjunction with 5th IEEE International Conference on e-Science (e-Science 2009), Oxford UK, 11, 2009.
[3]
Wei Lu, Jared Jackson, and Roger Barga. AzureBlast: A Case Study of Developing Science Applications on the Cloud. ScienceCloud 2010.
[4]
Michael C. Schatz et al. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. June 1; 25(11): 1363--1369, 2009.
[5]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004.
[6]
Ewa Deelman, Gurmeet Singh, Mei-Hui Su et al. Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal, Vol 13(3), pages 219--237, 2005.
[7]
Trident: Scientific Workflow Workbench for Oceanography. http://www.microsoft.com/mscorp/tc/trident.mspx
[8]
Oozie http://yahoo.github.com/oozie/
[9]
Jianwu Wang, Daniel Crawl, Ilkay Altintas. Kepler + Hadoop : A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems. WORKS09 ACM 2009, ISBN 978-1-60558-717-2.
[10]
X. Fei, S. Lu, and C. Lin. A mapreduce-enabled scientific workflow composition framework. WebServices, IEEE International Conference on, vol. 0, pp. 663--670, 2009.
[11]
Q. Chen, L. Wang, and Z. Shang. MRGIS: A MapReduce-Enabled high performance workflow system for GIS. In the 3rd International Workshop on Scientific Workflows and Business Workflow Standards in e-Science (SWBES). IEEE Press, December, 2008.
[12]
http://hadoop.apache.org/
[13]
Yunhong Gu, Robert Grossman. Sector and Sphere: The Design and Implementation of a High Performance Data Cloud. Theme Issue of the Philosophical Transactions of the Royal Society A: Crossing Boundaries: Computational Science, E-Science and Global E-Infrastructure, vol. 367 no. 1897 2429--2445, 28 June 2009.
[14]
Yunhong Gu and Robert Grossman. Lessons Learned From a Year's Worth of Benchmarks of Large Data Clouds. 2nd Workshop on Many-Task Computing on Grids and Supercomputers Portland, Oregon 11, 2009.
[15]
Ann Chervenak, Ewa Deelman, Miron Livny, Mei-Hui Su, Rob Schuler, Shishir Bharathi, Gaurang Mehta, Karan Vahi. Data Placement for Scientific Applications in Distributed Environments. Proceedings of Grid Conference, Austin, Texas, September 2007.
[16]
M. Halem, D. Chapman, P. Nguyen, C.Tilmes, Y. Yelena, N. Most, K. Stewart. Service Oriented Atmospheric Radiances (SOAR): Services for Gridding and Analysis of Multi-Sensor Satellite Radiance Data for Climate Studies. IEEE Trans. of Geoscience and Remote Sensing Journal. Vol.47, No. 1 Jan. P.114--122, 2009.
[17]
P. Nguyen, D. Chapman, M. Halem. Towards Producing a 40 Year Earth Science Data Record of IR Radiances. American Geophysical Union, San Francisco, San Francisco, Dec.14--18, 2009.
[18]
G. Leptoukh. NASA remote sensing data in earth sciences: processing, archiving, distribution, applications at the GES DISCISRSE. http://www.isprs.org/publications/related/ISRSE/html/papers/217.pdf .
[19]
D. J. Goodman. Introduction and evaluation of Martlet: a scientific workflow language for abstracted parallelization. In WWW, pages 983--992, 2007.
[20]
Mihcael Wilde, Ian Foster, Kamil Iskra, Pete Beckman, Zhao Zhang, Allan Espinosa, Mihael Hategan, Ben Clifford, Ioan Raicu. Parallel Scripting for Applications at the Petascale and Beyond Computer. Vol. 42, No. 11 2009.
[21]
N. Golpayegani, M. Halem. Cloud Computing for Satellite Data Processing on High End Compute Clusters. IEEE International Conference on Cloud Computing, 2009.
[22]
Jie Li, Deb Agarwal, Marty Humphrey, Catharine van Ingen, Keith Jackson, and Youngryel Ryu. eScience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Azure Platform. In IPDPS, IEEE, 2010.

Cited By

View all
  • (2017)Mirror Mirror on the Wall, How Do I Dimension My Cloud After All?Cloud Computing10.1007/978-3-319-54645-2_2(27-58)Online publication date: 3-Jun-2017
  • (2017)A New Data Placement Approach for Scientific Workflows in Cloud Computing EnvironmentsIntelligent Systems Design and Applications10.1007/978-3-319-53480-0_33(330-340)Online publication date: 23-Feb-2017
  • (2016)Big Data and cloud computing: innovation opportunities and challengesInternational Journal of Digital Earth10.1080/17538947.2016.123977110:1(13-53)Online publication date: 3-Nov-2016
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SECLOUD '11: Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing
May 2011
80 pages
ISBN:9781450305822
DOI:10.1145/1985500
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cloud computing
  2. mapreduce applications
  3. scheduling
  4. workflow scheduling
  5. workflow system

Qualifiers

  • Research-article

Conference

ICSE11
Sponsor:
ICSE11: International Conference on Software Engineering
May 22, 2011
HI, Waikiki, Honolulu, USA

Upcoming Conference

ICSE 2025

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2017)Mirror Mirror on the Wall, How Do I Dimension My Cloud After All?Cloud Computing10.1007/978-3-319-54645-2_2(27-58)Online publication date: 3-Jun-2017
  • (2017)A New Data Placement Approach for Scientific Workflows in Cloud Computing EnvironmentsIntelligent Systems Design and Applications10.1007/978-3-319-53480-0_33(330-340)Online publication date: 23-Feb-2017
  • (2016)Big Data and cloud computing: innovation opportunities and challengesInternational Journal of Digital Earth10.1080/17538947.2016.123977110:1(13-53)Online publication date: 3-Nov-2016
  • (2016)An optimized MapReduce workflow scheduling algorithm for heterogeneous computingThe Journal of Supercomputing10.1007/s11227-014-1335-272:6(2059-2079)Online publication date: 1-Jun-2016
  • (2016)Extending Science Gateway Frameworks to Support Big Data Applications in the CloudJournal of Grid Computing10.1007/s10723-016-9369-814:4(589-601)Online publication date: 1-Dec-2016
  • (2016)A Dynamic Cloud Dimensioning Approach for Parallel Scientific WorkflowsJournal of Grid Computing10.1007/s10723-016-9367-x14:3(443-461)Online publication date: 1-Sep-2016
  • (2015)Extending Scientific Workflow Systems to Support MapReduce Based Applications in the CloudProceedings of the 2015 7th International Workshop on Science Gateways10.1109/IWSG.2015.15(16-21)Online publication date: 3-Jun-2015
  • (2015)Research on performance optimization and visualization tool of Hadoop2015 10th International Conference on Computer Science & Education (ICCSE)10.1109/ICCSE.2015.7250233(149-153)Online publication date: Jul-2015
  • (2015)Optimizing virtual machine allocation for parallel scientific workflows in federated cloudsFuture Generation Computer Systems10.1016/j.future.2014.10.00946:C(51-68)Online publication date: 1-May-2015
  • (2013)Multiple objective scheduling of HPC workloads through dynamic prioritizationProceedings of the High Performance Computing Symposium10.5555/2499968.2499981(1-8)Online publication date: 7-Apr-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media