ABSTRACT
Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented demands on traditional single-processor bioinformatics algorithms. Middleware and technologies for scientific workflows and data-intensive computing promise new capabilities to enable rapid analysis of next-generation sequence data. Based on this motivation and our previous experiences in bioinformatics and distributed scientific workflows, we are creating a Kepler Scientific Workflow System module, called "bioKepler", that facilitates the development of Kepler workflows for integrated execution of bioinformatics applications in distributed environments. This vision paper discusses the challenges related to next-generation sequencing data, explains the approaches taken in bioKepler to help with analysis of such data, and presents preliminary results demonstrating these approaches.
- I. Altintas, O. Barney, Z. Cheng, T. Critchlow, B. Ludaescher, S. Parker, A. Shoshani, and M. Vouk. Accelerating the scientific exploration process with scientific workflows. Journal of Physics: Conference Series, 46:468--478, 2006. SciDAC 2006.Google ScholarCross Ref
- I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the kepler scientific workflow system. In Proceedings of International Provenance and Annotation Workshop, pages 118--132, 2006. Google ScholarDigital Library
- I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludaescher, and S. Mock. Kepler: An extensible system for design and execution of scientific workflows. In Intl. Conference on Scientific and Statistical Database Management (SSDBM), Santorini Island, Greece, 2004. Google ScholarDigital Library
- S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215(3):403--410, 1990.Google ScholarCross Ref
- D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 119--130, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- E. Deelman, G. Mehta, G. Singh, M.-H. Su, and K. Vahi. Pegasus: Mapping large-scale workflows to distributed resources. In I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields, editors, Workflows for e-Science, pages 376--394. Springer London, 2007.Google Scholar
- T. Disz, M. Kubal, R. Olson, R. Overbeek, and R. Stevens. Challenges in large scale distributed computing: bioinformatics. In Proceedings of Challenges of Large Applications in Distributed Environments, 2005. CLADE 2005., pages 57--65. IEEE, 2005.Google ScholarCross Ref
- X. Fei, S. Lu, and C. Lin. A mapreduce-enabled scientific workflow composition framework. In ICWS '09: Proceedings of the 2009 IEEE International Conference on Web Services, pages 663--670, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
- A. Goderis, A. Brooks, I. Altintas, C. Goble, and E. Lee. Composing different models of computation in Kepler and Ptolemy II. Lecture Notes in Computer Science, III:182--190, 2007. Proc. 2nd International Workshop on Workflow systems in e-Science in conjunction with ICCS 2007. Google ScholarDigital Library
- A. Goderis, C. Brooks, I. Altintas, E. Lee, and C. Goble. Heterogeneous composition of models of computation. Future Generation Computer Systems, 25(5):552--560, 2009. Google ScholarDigital Library
- D. J. Goodman. Introduction and evaluation of martlet: a scientific workflow language for abstracted parallelisation. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 983--992, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- I. Gorton, P. Greenfield, A. S. Szalay, and R. Williams. Data-intensive computing in the 21st century. IEEE Computer, 41(4):30--32, 2008. Google ScholarDigital Library
- B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg. Searching for snps with cloud computing. Genome Biology, 10(134), November 2009.Google Scholar
- H. Li and Z. Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25(14):1754--1760, 2009. Google ScholarDigital Library
- B. Ludaescher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, J. Jones, M. and Lee, and Y. Zhao. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, 18(10):1039--1065, 2006. Google ScholarDigital Library
- M. Margulies, M. Egholm, W. Altman, S. Attiya, J. Bader, L. Bemben, J. Berka, M. Braverman, Y. Chen, Z. Chen, S. Dewell, L. Du, J. Fierro, X. Gomes, B. Godwin, W. He, S. Helgesen, C. Ho, G. Irzyk, S. Jando, M. Alenquer, T. Jarvie, K. Jirage, J. Kim, J. Knight, J. Lanza, J. Leamon, S. Lefkowitz, M. Lei, J. Li, K. Lohman, H. Lu, V. Makhijani, K. McDade, M. McKenna, E. Myers, E. Nickerson, J. Nobile, R. Plant, B. Puc, M. Ronan, G. Roth, G. Sarkis, J. Simons, J. Simpson, M. Srinivasan, K. Tartaro, A. Tomasz, K. Vogt, G. Volkmer, S. Wang, Y. Wang, M. Weiner, P. Yu, R. Begley, and J. Rothberg. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376--380, September 2005.Google ScholarCross Ref
- C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, and D. Thain. All-pairs: An abstraction for data-intensive computing on campus grids. IEEE Transactions on Parallel and Distributed Systems, 21:33--46, 2010. Google ScholarDigital Library
- P. Mouallem, D. Crawl, I. Altintas, M. A. Vouk, and U. Yildiz. A fault-tolerance architecture for kepler-based distributed scientific workflows. In Proceedings of Scientific and Statistical Database Management, 22nd International Conference (SSDBM 2010), volume 6187 of Lecture Notes in Computer Science, pages 452--460, Berlin, Heidelberg, 2010. Springer. Google ScholarDigital Library
- T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. in. Bioinformatics, Oxford University Press, London, UK, 20(17):3045--3054, 2004. Google ScholarDigital Library
- J. Qin and T. Fahringer. Advanced data flow support for scientific grid workflow applications. In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- M. Schatz. Cloudburst: Highly sensitive read mapping with mapreduce. Bioinformatics, 25(11):1363--1369, April 2009. Google ScholarDigital Library
- J. Shendure and H. Ji. Next generation-dna sequencing. Nature Biotechnology, 26(10):1135--1145, 2008.Google ScholarCross Ref
- A. D. Smith, Z. Xuan, and M. Q. Zhang. Using quality scores and longer reads improves accuracy of solexa read mapping. BMC Bioinformatics, 9(128), February 2008.Google Scholar
- L. D. Stein. The case for cloud computing in genome informatics. Genome Biology, 11(5):207, 2010.Google ScholarCross Ref
- I. Taylor, M. Shields, I. Wang, and O. Rana. Triana applications within grid computing and peer to peer environments. Journal of Grid Computing, 1, 2003.Google Scholar
- I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields, editors. Workflows for e-Science. Springer, 2007.Google Scholar
- J. Wang, I. Altintas, P. R. Hosseini, D. Barseghian, D. Crawl, C. Berkley, and M. B. Jones. Accelerating parameter sweep workflows by utilizing ad-hoc network computing resources: An ecological example. In Services, IEEE Congress on, pages 267--274. IEEE Computer Society, 2009. Google ScholarDigital Library
- J. Wang, D. Crawl, and I. Altintas. Kepler + Hadoop: A general architecture facilitating data-intensive applications in scientific workflow systems. In WORKS '09: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, pages 1--8, Portland, Oregon, 2009. ACM New York, NY, USA. Google ScholarDigital Library
- D. Warneke and O. Kao. Exploiting dynamic resource allocation for efficient parallel data processing in the cloud. Parallel and Distributed Systems, IEEE Transactions on, 22(6):985--997, June 2011. Google ScholarDigital Library
- Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. V. Laszewski, I. Raicu, T. Stef-praun, and M. Wilde. Swift: Fast, reliable, loosely coupled parallel computation. In Services, 2007 IEEE Congress on, pages 199--206. IEEE Press, 2007.Google ScholarCross Ref
Index Terms
- Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper
Recommendations
Distributed workflow-driven analysis of large-scale biological data using biokepler
PDAC '11: Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunitiesNext-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges, placing unprecedented demands on traditional single-processor bioinformatics algorithms. Technologies like ...
RECOMBFLOW: a scientific workflow environment for Intragenomic Gene Conversion analysis in bacterial genomes, including the pathogen Streptococcus pyogenes
Intragenomic Gene Conversion (IGC) is important in the evolution of bacteria but has only been analysed computationally in a few strains of Escherichia coli. This paper describes a scientific workflow system, called RECOMBFLOW, that automates this ...
Data parallelism in bioinformatics workflows using Hydra
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed ComputingLarge scale bioinformatics experiments are usually composed by a set of data flows generated by a chain of activities (programs or services) that may be modeled as scientific workflows. Current Scientific Workflow Management Systems (SWfMS) are used to ...
Comments