research-article

Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper

Authors:
Ilkay Altintas

University of California, San Diego, La Jolla, CA

University of California, San Diego, La Jolla, CA
View Profile

,
Jianwu Wang

University of California, San Diego, La Jolla, CA

University of California, San Diego, La Jolla, CA
View Profile

,
Daniel Crawl

University of California, San Diego, La Jolla, CA

University of California, San Diego, La Jolla, CA
View Profile

,
Weizhong Li

University of California, San Diego, La Jolla, CA

University of California, San Diego, La Jolla, CA
View Profile

EDBT-ICDT '12: Proceedings of the 2012 Joint EDBT/ICDT WorkshopsMarch 2012Pages 73–78https://doi.org/10.1145/2320765.2320791

Published:30 March 2012Publication History

EDBT-ICDT '12: Proceedings of the 2012 Joint EDBT/ICDT Workshops

Pages 73–78

ABSTRACT

Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges and placing unprecedented demands on traditional single-processor bioinformatics algorithms. Middleware and technologies for scientific workflows and data-intensive computing promise new capabilities to enable rapid analysis of next-generation sequence data. Based on this motivation and our previous experiences in bioinformatics and distributed scientific workflows, we are creating a Kepler Scientific Workflow System module, called "bioKepler", that facilitates the development of Kepler workflows for integrated execution of bioinformatics applications in distributed environments. This vision paper discusses the challenges related to next-generation sequencing data, explains the approaches taken in bioKepler to help with analysis of such data, and presents preliminary results demonstrating these approaches.

References

I. Altintas, O. Barney, Z. Cheng, T. Critchlow, B. Ludaescher, S. Parker, A. Shoshani, and M. Vouk. Accelerating the scientific exploration process with scientific workflows. Journal of Physics: Conference Series, 46:468--478, 2006. SciDAC 2006.Google ScholarCross Ref
I. Altintas, O. Barney, and E. Jaeger-Frank. Provenance collection support in the kepler scientific workflow system. In Proceedings of International Provenance and Annotation Workshop, pages 118--132, 2006. Google ScholarDigital Library
I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludaescher, and S. Mock. Kepler: An extensible system for design and execution of scientific workflows. In Intl. Conference on Scientific and Statistical Database Management (SSDBM), Santorini Island, Greece, 2004. Google ScholarDigital Library
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic Local Alignment Search Tool. Journal of Molecular Biology, 215(3):403--410, 1990.Google ScholarCross Ref
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC '10, pages 119--130, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
E. Deelman, G. Mehta, G. Singh, M.-H. Su, and K. Vahi. Pegasus: Mapping large-scale workflows to distributed resources. In I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields, editors, Workflows for e-Science, pages 376--394. Springer London, 2007.Google Scholar
T. Disz, M. Kubal, R. Olson, R. Overbeek, and R. Stevens. Challenges in large scale distributed computing: bioinformatics. In Proceedings of Challenges of Large Applications in Distributed Environments, 2005. CLADE 2005., pages 57--65. IEEE, 2005.Google ScholarCross Ref
X. Fei, S. Lu, and C. Lin. A mapreduce-enabled scientific workflow composition framework. In ICWS '09: Proceedings of the 2009 IEEE International Conference on Web Services, pages 663--670, Washington, DC, USA, 2009. IEEE Computer Society. Google ScholarDigital Library
A. Goderis, A. Brooks, I. Altintas, C. Goble, and E. Lee. Composing different models of computation in Kepler and Ptolemy II. Lecture Notes in Computer Science, III:182--190, 2007. Proc. 2nd International Workshop on Workflow systems in e-Science in conjunction with ICCS 2007. Google ScholarDigital Library
A. Goderis, C. Brooks, I. Altintas, E. Lee, and C. Goble. Heterogeneous composition of models of computation. Future Generation Computer Systems, 25(5):552--560, 2009. Google ScholarDigital Library
D. J. Goodman. Introduction and evaluation of martlet: a scientific workflow language for abstracted parallelisation. In WWW '07: Proceedings of the 16th international conference on World Wide Web, pages 983--992, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
I. Gorton, P. Greenfield, A. S. Szalay, and R. Williams. Data-intensive computing in the 21st century. IEEE Computer, 41(4):30--32, 2008. Google ScholarDigital Library
B. Langmead, M. C. Schatz, J. Lin, M. Pop, and S. L. Salzberg. Searching for snps with cloud computing. Genome Biology, 10(134), November 2009.Google Scholar
H. Li and Z. Durbin. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics, 25(14):1754--1760, 2009. Google ScholarDigital Library
B. Ludaescher, I. Altintas, C. Berkley, D. Higgins, E. Jaeger-Frank, J. Jones, M. and Lee, and Y. Zhao. Scientific workflow management and the Kepler system. Concurrency and Computation: Practice & Experience, Special Issue on Scientific Workflows, 18(10):1039--1065, 2006. Google ScholarDigital Library
M. Margulies, M. Egholm, W. Altman, S. Attiya, J. Bader, L. Bemben, J. Berka, M. Braverman, Y. Chen, Z. Chen, S. Dewell, L. Du, J. Fierro, X. Gomes, B. Godwin, W. He, S. Helgesen, C. Ho, G. Irzyk, S. Jando, M. Alenquer, T. Jarvie, K. Jirage, J. Kim, J. Knight, J. Lanza, J. Leamon, S. Lefkowitz, M. Lei, J. Li, K. Lohman, H. Lu, V. Makhijani, K. McDade, M. McKenna, E. Myers, E. Nickerson, J. Nobile, R. Plant, B. Puc, M. Ronan, G. Roth, G. Sarkis, J. Simons, J. Simpson, M. Srinivasan, K. Tartaro, A. Tomasz, K. Vogt, G. Volkmer, S. Wang, Y. Wang, M. Weiner, P. Yu, R. Begley, and J. Rothberg. Genome sequencing in microfabricated high-density picolitre reactors. Nature, 437(7057):376--380, September 2005.Google ScholarCross Ref
C. Moretti, H. Bui, K. Hollingsworth, B. Rich, P. Flynn, and D. Thain. All-pairs: An abstraction for data-intensive computing on campus grids. IEEE Transactions on Parallel and Distributed Systems, 21:33--46, 2010. Google ScholarDigital Library
P. Mouallem, D. Crawl, I. Altintas, M. A. Vouk, and U. Yildiz. A fault-tolerance architecture for kepler-based distributed scientific workflows. In Proceedings of Scientific and Statistical Database Management, 22nd International Conference (SSDBM 2010), volume 6187 of Lecture Notes in Computer Science, pages 452--460, Berlin, Heidelberg, 2010. Springer. Google ScholarDigital Library
T. Oinn, M. Addis, J. Ferris, D. Marvin, M. Senger, M. Greenwood, T. Carver, K. Glover, M. R. Pocock, A. Wipat, and P. Li. Taverna: a tool for the composition and enactment of bioinformatics workflows. in. Bioinformatics, Oxford University Press, London, UK, 20(17):3045--3054, 2004. Google ScholarDigital Library
J. Qin and T. Fahringer. Advanced data flow support for scientific grid workflow applications. In SC '07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1--12, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
M. Schatz. Cloudburst: Highly sensitive read mapping with mapreduce. Bioinformatics, 25(11):1363--1369, April 2009. Google ScholarDigital Library
J. Shendure and H. Ji. Next generation-dna sequencing. Nature Biotechnology, 26(10):1135--1145, 2008.Google ScholarCross Ref
A. D. Smith, Z. Xuan, and M. Q. Zhang. Using quality scores and longer reads improves accuracy of solexa read mapping. BMC Bioinformatics, 9(128), February 2008.Google Scholar
L. D. Stein. The case for cloud computing in genome informatics. Genome Biology, 11(5):207, 2010.Google ScholarCross Ref
I. Taylor, M. Shields, I. Wang, and O. Rana. Triana applications within grid computing and peer to peer environments. Journal of Grid Computing, 1, 2003.Google Scholar
I. J. Taylor, E. Deelman, D. B. Gannon, and M. Shields, editors. Workflows for e-Science. Springer, 2007.Google Scholar
J. Wang, I. Altintas, P. R. Hosseini, D. Barseghian, D. Crawl, C. Berkley, and M. B. Jones. Accelerating parameter sweep workflows by utilizing ad-hoc network computing resources: An ecological example. In Services, IEEE Congress on, pages 267--274. IEEE Computer Society, 2009. Google ScholarDigital Library
J. Wang, D. Crawl, and I. Altintas. Kepler + Hadoop: A general architecture facilitating data-intensive applications in scientific workflow systems. In WORKS '09: Proceedings of the 4th Workshop on Workflows in Support of Large-Scale Science, pages 1--8, Portland, Oregon, 2009. ACM New York, NY, USA. Google ScholarDigital Library
D. Warneke and O. Kao. Exploiting dynamic resource allocation for efficient parallel data processing in the cloud. Parallel and Distributed Systems, IEEE Transactions on, 22(6):985--997, June 2011. Google ScholarDigital Library
Y. Zhao, M. Hategan, B. Clifford, I. Foster, G. V. Laszewski, I. Raicu, T. Stef-praun, and M. Wilde. Swift: Fast, reliable, loosely coupled parallel computation. In Services, 2007 IEEE Congress on, pages 199--206. IEEE Press, 2007.Google ScholarCross Ref

Index Terms

Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper
1. Information systems
  1. Information systems applications

Recommendations

Distributed workflow-driven analysis of large-scale biological data using biokepler
PDAC '11: Proceedings of the 2nd international workshop on Petascal data analytics: challenges and opportunities

Next-generation DNA sequencing machines are generating a very large amount of sequence data with applications in many scientific challenges, placing unprecedented demands on traditional single-processor bioinformatics algorithms. Technologies like ...
Read More
RECOMBFLOW: a scientific workflow environment for Intragenomic Gene Conversion analysis in bacterial genomes, including the pathogen Streptococcus pyogenes

Intragenomic Gene Conversion (IGC) is important in the evolution of bacteria but has only been analysed computationally in a few strains of Escherichia coli. This paper describes a scientific workflow system, called RECOMBFLOW, that automates this ...
Read More
Data parallelism in bioinformatics workflows using Hydra
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

Large scale bioinformatics experiments are usually composed by a set of data flows generated by a chain of activities (programs or services) that may be modeled as scientific workflows. Current Scientific Workflow Management Systems (SWfMS) are used to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT-ICDT '12: Proceedings of the 2012 Joint EDBT/ICDT Workshops
March 2012
265 pages
ISBN:9781450311434
DOI:10.1145/2320765
Editors:
Divesh Srivastava
AT&T Labs-Research
,
Ismail Ari
Ozyegin University, Turkey
Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 March 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
application
bioinformatics
data-parallel patterns
next generation sequence analysis
scientific workflows
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of10submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 329
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Challenges and approaches for distributed workflow-driven analysis of large-scale biological data: vision paper

EDBT-ICDT '12: Proceedings of the 2012 Joint EDBT/ICDT Workshops

ABSTRACT

References

Cited By

Index Terms

Recommendations

Distributed workflow-driven analysis of large-scale biological data using biokepler

RECOMBFLOW: a scientific workflow environment for Intragenomic Gene Conversion analysis in bacterial genomes, including the pathogen Streptococcus pyogenes

Data parallelism in bioinformatics workflows using Hydra