skip to main content
10.1145/1985500.1985510acmconferencesArticle/Chapter ViewAbstractPublication PagesicseConference Proceedingsconference-collections
research-article

A MapReduce workflow system for architecting scientific data intensive applications

Authors Info & Claims
Published:22 May 2011Publication History

ABSTRACT

MapReduce is promising for developing both scalable business and scientific data intensive applications. However, there are few existing scientific workflow systems which can benefit from the MapReduce programming model. We propose a workflow system for integrating structure, and orchestrating MapReduce jobs for scientific data intensive workflows. The system consists of a simple workflow design C++ API, a job scheduler, and a runtime support system for Hadoop or Sector/Sphere frameworks. A climate satellite data intensive processing and analysis application is developed as a use case and an evaluation for the workflow system. The evaluation shows that it is possible to make the steps in the climate data intensive application automatically from data gridding to complex data analysis using the workflow system. The performance of the climate analysis application is significantly improved by the enabled MapReduce workflow system compared with the sequential embarrassing parallel methods. The overhead of the workflow system is negligible. However, the graphic user interface is still under development for the workflow system.

References

  1. Kouzes, Richard T. Anderson, Gordon A. Elbert, Stephen T. Gorton, Ian Gracio. The Changing Paradigm of Data-Intensive Computing. IEEE Computer Society Press Volume 42, Issue 1, January 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Gideon Juve, Ewa Deelman, et al. Scientific Workflow Applications on Amazon EC2. Workshop on Cloud-based Services and Applications in conjunction with 5th IEEE International Conference on e-Science (e-Science 2009), Oxford UK, 11, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  3. Wei Lu, Jared Jackson, and Roger Barga. AzureBlast: A Case Study of Developing Science Applications on the Cloud. ScienceCloud 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Michael C. Schatz et al. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics. June 1; 25(11): 1363--1369, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ewa Deelman, Gurmeet Singh, Mei-Hui Su et al. Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems. Scientific Programming Journal, Vol 13(3), pages 219--237, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Trident: Scientific Workflow Workbench for Oceanography. http://www.microsoft.com/mscorp/tc/trident.mspxGoogle ScholarGoogle Scholar
  8. Oozie http://yahoo.github.com/oozie/Google ScholarGoogle Scholar
  9. Jianwu Wang, Daniel Crawl, Ilkay Altintas. Kepler + Hadoop : A General Architecture Facilitating Data-Intensive Applications in Scientific Workflow Systems. WORKS09 ACM 2009, ISBN 978-1-60558-717-2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. X. Fei, S. Lu, and C. Lin. A mapreduce-enabled scientific workflow composition framework. WebServices, IEEE International Conference on, vol. 0, pp. 663--670, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Q. Chen, L. Wang, and Z. Shang. MRGIS: A MapReduce-Enabled high performance workflow system for GIS. In the 3rd International Workshop on Scientific Workflows and Business Workflow Standards in e-Science (SWBES). IEEE Press, December, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. http://hadoop.apache.org/Google ScholarGoogle Scholar
  13. Yunhong Gu, Robert Grossman. Sector and Sphere: The Design and Implementation of a High Performance Data Cloud. Theme Issue of the Philosophical Transactions of the Royal Society A: Crossing Boundaries: Computational Science, E-Science and Global E-Infrastructure, vol. 367 no. 1897 2429--2445, 28 June 2009.Google ScholarGoogle Scholar
  14. Yunhong Gu and Robert Grossman. Lessons Learned From a Year's Worth of Benchmarks of Large Data Clouds. 2nd Workshop on Many-Task Computing on Grids and Supercomputers Portland, Oregon 11, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Ann Chervenak, Ewa Deelman, Miron Livny, Mei-Hui Su, Rob Schuler, Shishir Bharathi, Gaurang Mehta, Karan Vahi. Data Placement for Scientific Applications in Distributed Environments. Proceedings of Grid Conference, Austin, Texas, September 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Halem, D. Chapman, P. Nguyen, C.Tilmes, Y. Yelena, N. Most, K. Stewart. Service Oriented Atmospheric Radiances (SOAR): Services for Gridding and Analysis of Multi-Sensor Satellite Radiance Data for Climate Studies. IEEE Trans. of Geoscience and Remote Sensing Journal. Vol.47, No. 1 Jan. P.114--122, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  17. P. Nguyen, D. Chapman, M. Halem. Towards Producing a 40 Year Earth Science Data Record of IR Radiances. American Geophysical Union, San Francisco, San Francisco, Dec.14--18, 2009.Google ScholarGoogle Scholar
  18. G. Leptoukh. NASA remote sensing data in earth sciences: processing, archiving, distribution, applications at the GES DISCISRSE. http://www.isprs.org/publications/related/ISRSE/html/papers/217.pdf .Google ScholarGoogle Scholar
  19. D. J. Goodman. Introduction and evaluation of Martlet: a scientific workflow language for abstracted parallelization. In WWW, pages 983--992, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Mihcael Wilde, Ian Foster, Kamil Iskra, Pete Beckman, Zhao Zhang, Allan Espinosa, Mihael Hategan, Ben Clifford, Ioan Raicu. Parallel Scripting for Applications at the Petascale and Beyond Computer. Vol. 42, No. 11 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. N. Golpayegani, M. Halem. Cloud Computing for Satellite Data Processing on High End Compute Clusters. IEEE International Conference on Cloud Computing, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Jie Li, Deb Agarwal, Marty Humphrey, Catharine van Ingen, Keith Jackson, and Youngryel Ryu. eScience in the Cloud: A MODIS Satellite Data Reprojection and Reduction Pipeline in the Windows Azure Platform. In IPDPS, IEEE, 2010.Google ScholarGoogle Scholar

Index Terms

  1. A MapReduce workflow system for architecting scientific data intensive applications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SECLOUD '11: Proceedings of the 2nd International Workshop on Software Engineering for Cloud Computing
        May 2011
        80 pages
        ISBN:9781450305822
        DOI:10.1145/1985500

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 May 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Upcoming Conference

        ICSE 2025

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader