ABSTRACT
The advent of new sequencing technologies has generated extremely large amounts of information. To successfully apply bioinformatics tools to such large datasets, they need to exhibit scalability and ideally elasticity in diverse computing environments. We describe the application of Weaver to the PEMer structural variation detection workflow. Because the original workflow has an intractable sequential running time on large datasets, it also has a batch implementation designed for a shared file system. Using scripts provided by the developers of PEMer, along with the Weaver Python module, the Starch archive generator, and the Makeflow workflow engine, we have refactored PEMer for elastic scaling on personal clouds. Our case study describes the various challenges faced when constructing such a workflow, from dealing with failure detection, to managing dependencies, to handling the quirks of the underlying operating systems. The practice of scaling bioinformatics tools is increasingly commonplace. As such, the hands-on application of refactoring techniques to PEMer can serve as a valuable guide for those looking to reconfigure other bioinformatics software. Significantly, our customized Makeflow framework enabled elastic deployment on a wider variety of systems while substantially reducing wall clock runtimes using hundreds of cores.
- J. K. Colbourne, M. E. Pfrender, D. Gilbert, W. K. Thomas, and others. The Ecoresponsive Genome of phDaphnia pulex. Science, 331(6017):555--561, 2011.Google ScholarCross Ref
- A. E. Darling, L. Carey, and W. chun Feng. The design, implementation, and evaluation of mpiBLAST. In In Proceedings of ClusterWorld 2003, 2003.Google Scholar
- P. H.-P. Implementation and E. O. E. Lusk. Scalable unix commands for parallel. In In Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 410--418. Springer, 2001. Google ScholarDigital Library
- M. L. Kazar. Synchronization and caching issues in the andrew file system, 1988.Google Scholar
- J. Korbel, A. Abyzov, X. Mu, N. Carriero, P. Cayting, Z. Zhang, M. Snyder, and M. Gerstein. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology, 10(2):R23+, 2009.Google Scholar
- C. Moretti, J. Bulosan, D. Thain, and P. J. Flynn. All-pairs: An abstraction for data-intensive cloud computing. In 22nd IEEE International Symposium on Parallel and Distributed Processing, IPDPS, pages 1--11, 2008.Google ScholarCross Ref
- C. Moretti, M. Olson, S. J. Emrich, and D. Thain. Highly scalable genome assembly on campus grids. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS, 2009. Google ScholarDigital Library
- A. Pang, J. MacDonald, D. Pinto, J. Wei, et al. Towards a comprehensive structural variation map of an individual human genome. Genome Biology, 11(5):R52+, 2010.Google Scholar
- M. C. Schatz, B. Langmead, and S. L. Salzberg. Cloud computing and the DNA data race. Nature Biotechnology, 28(7):691--693, 2010.Google ScholarCross Ref
- A. Thrasher, R. Carmichael, P. Bui, L. Yu, D. Thain, and S. Emrich. Taming complex bioinformatics workflows with Weaver, Makeflow, and Starch. In In Proceedings of 5th Workshop of Workflows in Support of Large-Scale Science 2010, 2010.Google ScholarCross Ref
- O. Trelles. On the parallelisation of bioinformatics applications. Briefings in Bioinformatics, 2(2):181--219, 2001.Google ScholarCross Ref
- L. Yu, C. Moretti, A. Thrasher, S. J. Emrich, K. Judd, and D. Thain. Harnessing parallelism in multicore clusters with the All-Pairs, Wavefront, and Makeflow abstractions. Cluster Computing, 13(3):243--256, 2010. Google ScholarDigital Library
Index Terms
- Adapting bioinformatics applications for heterogeneous systems: a case study
Recommendations
Interoperability of GADU in Using Heterogeneous Grid Resources for Bioinformatics Applications
Bioinformatics tools used for efficient and computationally intensive analysis of genetic sequences require large-scale computational resources to accommodate the growing data. Grid computational resources such as the Open Science Grid and TeraGrid have ...
A CORBA compliant transactional workflow system for internet applications
Middleware '98: Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed ProcessingThe paper describes an application composition and execution environment implemented as a transactional workflow system that enables sets of inter-related tasks to be carried out and supervised in a dependable manner. The paper describes how the system ...
Adapting bioinformatics applications for heterogeneous systems: a case study
The advent of new sequencing technologies has generated extremely large amounts of information. To successfully apply bioinformatics tools to such large datasets, they need to exhibit scalability and ideally elasticity in diverse computing environments. ...
Comments