skip to main content
10.1145/2320765.2320791acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper

Authors Info & Claims
Published:30 March 2012Publication History

ABSTRACT

Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented demands on traditional single-processor bioinformatics algorithms. Middleware and technologies for scientific workflows and data-intensive computing promise new capabilities to enable rapid analysis of next-generation sequence data. Based on this motivation and our previous experiences in bioinformatics and distributed scientific workflows, we are creating a Kepler Scientific Workflow System module, called "bioKepler", that facilitates the development of Kepler workflows for integrated execution of bioinformatics applications in distributed environments. This vision paper discusses the challenges related to next-generation sequencing data, explains the approaches taken in bioKepler to help with analysis of such data, and presents preliminary results demonstrating these approaches.

References

  1. I. Altintas, O. Barney, Z. Cheng, T. Critchlow, B. Ludaescher, S. Parker, A. Shoshani, and M. Vouk. Accelerating the scientific exploration process with scientific workflows. Journal of Physics: Conference Series, 46:468--478, 2006. SciDAC 2006.Google ScholarGoogle ScholarCross RefCross Ref
  2. I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the kepler scientific workflow system. In Proceedings of International Provenance and Annotation Workshop, pages 118--132, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludaescher, and S. Mock. Kepler: An extensible system for design and execution of scientific workflows. In Intl. Conference on Scientific and Statistical Database Management (SSDBM), Santorini Island, Greece, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215(3):403--410, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  5. D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 119--130, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Deelman, G. Mehta, G. Singh, M.-H. Su, and K. Vahi. Pegasus: Mapping large-scale workflows to distributed resources. In I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields, editors, Workflows for e-Science, pages 376--394. Springer London, 2007.Google ScholarGoogle Scholar
  8. T. Disz, M. Kubal, R. Olson, R. Overbeek, and R. Stevens. Challenges in large scale distributed computing: bioinformatics. In Proceedings of Challenges of Large Applications in Distributed Environments, 2005. CLADE 2005., pages 57--65. IEEE, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  9. X. Fei, S. Lu, and C. Lin. A mapreduce-enabled scientific workflow composition framework. In ICWS '09: Proceedings of the 2009 IEEE International Conference on Web Services, pages 663--670, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Goderis, A. Brooks, I. Altintas, C. Goble, and E. Lee. Composing different models of computation in Kepler and Ptolemy II. Lecture Notes in Computer Science, III:182--190, 2007. Proc. 2nd International Workshop on Workflow systems in e-Science in conjunction with ICCS 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Goderis, C. Brooks, I. Altintas, E. Lee, and C. Goble. Heterogeneous composition of models of computation. Future Generation Computer Systems, 25(5):552--560, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. D. J. Goodman. Introduction and evaluation of martlet: a scientific workflow language for abstracted parallelisation. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 983--992, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. I. Gorton, P. Greenfield, A. S. Szalay, and R. Williams. Data-intensive computing in the 21st century. IEEE Computer, 41(4):30--32, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg. Searching for snps with cloud computing. Genome Biology, 10(134), November 2009.Google ScholarGoogle Scholar
  15. H. Li and Z. Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25(14):1754--1760, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Ludaescher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, J. Jones, M. and Lee, and Y. Zhao. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, 18(10):1039--1065, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Margulies, M. Egholm, W. Altman, S. Attiya, J. Bader, L. Bemben, J. Berka, M. Braverman, Y. Chen, Z. Chen, S. Dewell, L. Du, J. Fierro, X. Gomes, B. Godwin, W. He, S. Helgesen, C. Ho, G. Irzyk, S. Jando, M. Alenquer, T. Jarvie, K. Jirage, J. Kim, J. Knight, J. Lanza, J. Leamon, S. Lefkowitz, M. Lei, J. Li, K. Lohman, H. Lu, V. Makhijani, K. McDade, M. McKenna, E. Myers, E. Nickerson, J. Nobile, R. Plant, B. Puc, M. Ronan, G. Roth, G. Sarkis, J. Simons, J. Simpson, M. Srinivasan, K. Tartaro, A. Tomasz, K. Vogt, G. Volkmer, S. Wang, Y. Wang, M. Weiner, P. Yu, R. Begley, and J. Rothberg. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376--380, September 2005.Google ScholarGoogle ScholarCross RefCross Ref
  18. C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, and D. Thain. All-pairs: An abstraction for data-intensive computing on campus grids. IEEE Transactions on Parallel and Distributed Systems, 21:33--46, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Mouallem, D. Crawl, I. Altintas, M. A. Vouk, and U. Yildiz. A fault-tolerance architecture for kepler-based distributed scientific workflows. In Proceedings of Scientific and Statistical Database Management, 22nd International Conference (SSDBM 2010), volume 6187 of Lecture Notes in Computer Science, pages 452--460, Berlin, Heidelberg, 2010. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. in. Bioinformatics, Oxford University Press, London, UK, 20(17):3045--3054, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Qin and T. Fahringer. Advanced data flow support for scientific grid workflow applications. In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12, New York, NY, USA, 2007. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Schatz. Cloudburst: Highly sensitive read mapping with mapreduce. Bioinformatics, 25(11):1363--1369, April 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Shendure and H. Ji. Next generation-dna sequencing. Nature Biotechnology, 26(10):1135--1145, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  24. A. D. Smith, Z. Xuan, and M. Q. Zhang. Using quality scores and longer reads improves accuracy of solexa read mapping. BMC Bioinformatics, 9(128), February 2008.Google ScholarGoogle Scholar
  25. L. D. Stein. The case for cloud computing in genome informatics. Genome Biology, 11(5):207, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  26. I. Taylor, M. Shields, I. Wang, and O. Rana. Triana applications within grid computing and peer to peer environments. Journal of Grid Computing, 1, 2003.Google ScholarGoogle Scholar
  27. I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields, editors. Workflows for e-Science. Springer, 2007.Google ScholarGoogle Scholar
  28. J. Wang, I. Altintas, P. R. Hosseini, D. Barseghian, D. Crawl, C. Berkley, and M. B. Jones. Accelerating parameter sweep workflows by utilizing ad-hoc network computing resources: An ecological example. In Services, IEEE Congress on, pages 267--274. IEEE Computer Society, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Wang, D. Crawl, and I. Altintas. Kepler + Hadoop: A general architecture facilitating data-intensive applications in scientific workflow systems. In WORKS '09: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, pages 1--8, Portland, Oregon, 2009. ACM New York, NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Warneke and O. Kao. Exploiting dynamic resource allocation for efficient parallel data processing in the cloud. Parallel and Distributed Systems, IEEE Transactions on, 22(6):985--997, June 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. V. Laszewski, I. Raicu, T. Stef-praun, and M. Wilde. Swift: Fast, reliable, loosely coupled parallel computation. In Services, 2007 IEEE Congress on, pages 199--206. IEEE Press, 2007.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      EDBT-ICDT '12: Proceedings of the 2012 Joint EDBT/ICDT Workshops
      March 2012
      265 pages
      ISBN:9781450311434
      DOI:10.1145/2320765

      Copyright © 2012 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 March 2012

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate7of10submissions,70%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader