ABSTRACT
MapReduce is promising for developing both scalable business and scientific data intensive applications. However, there are few existing scientific workflow systems which can benefit from the MapReduce programming model. We propose a workflow system for integrating structure, and orchestrating MapReduce jobs for scientific data intensive workflows. The system consists of a simple workflow design C++ API, a job scheduler, and a runtime support system for Hadoop or Sector/Sphere frameworks. A climate satellite data intensive processing and analysis application is developed as a use case and an evaluation for the workflow system. The evaluation shows that it is possible to make the steps in the climate data intensive application automatically from data gridding to complex data analysis using the workflow system. The performance of the climate analysis application is significantly improved by the enabled MapReduce workflow system compared with the sequential embarrassing parallel methods. The overhead of the workflow system is negligible. However, the graphic user interface is still under development for the workflow system.
- Kouzes, Richard T. Anderson, Gordon A. Elbert, Stephen T. Gorton, Ian Gracio. The Changing Paradigm of Data-Intensive Computing. IEEE Computer Society Press Volume 42, Issue 1, January 2009. Google ScholarDigital Library
- Gideon Juve, Ewa Deelman, et al. Scientific Workflow Applications on Amazon EC2. Workshop on Cloud-based Services and Applications in conjunction with 5th IEEE International Conference on e-Science (e-Science 2009), Oxford UK, 11, 2009.Google ScholarCross Ref
- Wei Lu, Jared Jackson, and Roger Barga. AzureBlast: A Case Study of Developing Science Applications on the Cloud. ScienceCloud 2010. Google ScholarDigital Library
- Michael C. Schatz et al. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. June 1; 25(11): 1363--1369, 2009. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- Ewa Deelman, Gurmeet Singh, Mei-Hui Su et al. Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal, Vol 13(3), pages 219--237, 2005. Google ScholarDigital Library
- Trident: Scientific Workflow Workbench for Oceanography. http://www.microsoft.com/mscorp/tc/trident.mspxGoogle Scholar
- Oozie http://yahoo.github.com/oozie/Google Scholar
- Jianwu Wang, Daniel Crawl, Ilkay Altintas. Kepler + Hadoop : A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems. WORKS09 ACM 2009, ISBN 978-1-60558-717-2. Google ScholarDigital Library
- X. Fei, S. Lu, and C. Lin. A mapreduce-enabled scientific workflow composition framework. WebServices, IEEE International Conference on, vol. 0, pp. 663--670, 2009. Google ScholarDigital Library
- Q. Chen, L. Wang, and Z. Shang. MRGIS: A MapReduce-Enabled high performance workflow system for GIS. In the 3rd International Workshop on Scientific Workflows and Business Workflow Standards in e-Science (SWBES). IEEE Press, December, 2008. Google ScholarDigital Library
- http://hadoop.apache.org/Google Scholar
- Yunhong Gu, Robert Grossman. Sector and Sphere: The Design and Implementation of a High Performance Data Cloud. Theme Issue of the Philosophical Transactions of the Royal Society A: Crossing Boundaries: Computational Science, E-Science and Global E-Infrastructure, vol. 367 no. 1897 2429--2445, 28 June 2009.Google Scholar
- Yunhong Gu and Robert Grossman. Lessons Learned From a Year's Worth of Benchmarks of Large Data Clouds. 2nd Workshop on Many-Task Computing on Grids and Supercomputers Portland, Oregon 11, 2009. Google ScholarDigital Library
- Ann Chervenak, Ewa Deelman, Miron Livny, Mei-Hui Su, Rob Schuler, Shishir Bharathi, Gaurang Mehta, Karan Vahi. Data Placement for Scientific Applications in Distributed Environments. Proceedings of Grid Conference, Austin, Texas, September 2007. Google ScholarDigital Library
- M. Halem, D. Chapman, P. Nguyen, C.Tilmes, Y. Yelena, N. Most, K. Stewart. Service Oriented Atmospheric Radiances (SOAR): Services for Gridding and Analysis of Multi-Sensor Satellite Radiance Data for Climate Studies. IEEE Trans. of Geoscience and Remote Sensing Journal. Vol.47, No. 1 Jan. P.114--122, 2009.Google ScholarCross Ref
- P. Nguyen, D. Chapman, M. Halem. Towards Producing a 40 Year Earth Science Data Record of IR Radiances. American Geophysical Union, San Francisco, San Francisco, Dec.14--18, 2009.Google Scholar
- G. Leptoukh. NASA remote sensing data in earth sciences: processing, archiving, distribution, applications at the GES DISCISRSE. http://www.isprs.org/publications/related/ISRSE/html/papers/217.pdf .Google Scholar
- D. J. Goodman. Introduction and evaluation of Martlet: a scientific workflow language for abstracted parallelization. In WWW, pages 983--992, 2007. Google ScholarDigital Library
- Mihcael Wilde, Ian Foster, Kamil Iskra, Pete Beckman, Zhao Zhang, Allan Espinosa, Mihael Hategan, Ben Clifford, Ioan Raicu. Parallel Scripting for Applications at the Petascale and Beyond Computer. Vol. 42, No. 11 2009. Google ScholarDigital Library
- N. Golpayegani, M. Halem. Cloud Computing for Satellite Data Processing on High End Compute Clusters. IEEE International Conference on Cloud Computing, 2009. Google ScholarDigital Library
- Jie Li, Deb Agarwal, Marty Humphrey, Catharine van Ingen, Keith Jackson, and Youngryel Ryu. eScience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Azure Platform. In IPDPS, IEEE, 2010.Google Scholar
Index Terms
- A MapReduce workflow system for architecting scientific data intensive applications
Recommendations
A Survey of Data-Intensive Scientific Workflow Management
Nowadays, more and more computer-based scientific experiments need to handle massive amounts of data. Their data processing consists of multiple computational steps and dependencies within them. A data-intensive scientific workflow is useful for ...
Distributed Caching of Scientific Workflows in Multisite Cloud
Database and Expert Systems ApplicationsAbstractMany scientific experiments are performed using scientific workflows, which are becoming more and more data-intensive. We consider the efficient execution of such workflows in the cloud, leveraging the heterogeneous resources available at multiple ...
Comments