research-article

Adapting bioinformatics applications for heterogeneous systems: a case study

Authors:
Irena Lanc

University of Notre Dame, Notre Dame, IN, USA

University of Notre Dame, Notre Dame, IN, USA
View Profile

,
Peter Bui

University of Notre Dame, Notre Dame, IN, USA

University of Notre Dame, Notre Dame, IN, USA
View Profile

,
Douglas Thain

University of Notre Dame, Notre Dame, IN, USA

University of Notre Dame, Notre Dame, IN, USA
View Profile

,
Scott Emrich

University of Notre Dame, Notre Dame, IN, USA

University of Notre Dame, Notre Dame, IN, USA
View Profile

ECMLS '11: Proceedings of the second international workshop on Emerging computational methods for the life sciencesJune 2011Pages 7–14https://doi.org/10.1145/1996023.1996025

Published:08 June 2011Publication History

ECMLS '11: Proceedings of the second international workshop on Emerging computational methods for the life sciences

Pages 7–14

ABSTRACT

The advent of new sequencing technologies has generated extremely large amounts of information. To successfully apply bioinformatics tools to such large datasets, they need to exhibit scalability and ideally elasticity in diverse computing environments. We describe the application of Weaver to the PEMer structural variation detection workflow. Because the original workflow has an intractable sequential running time on large datasets, it also has a batch implementation designed for a shared file system. Using scripts provided by the developers of PEMer, along with the Weaver Python module, the Starch archive generator, and the Makeflow workflow engine, we have refactored PEMer for elastic scaling on personal clouds. Our case study describes the various challenges faced when constructing such a workflow, from dealing with failure detection, to managing dependencies, to handling the quirks of the underlying operating systems. The practice of scaling bioinformatics tools is increasingly commonplace. As such, the hands-on application of refactoring techniques to PEMer can serve as a valuable guide for those looking to reconfigure other bioinformatics software. Significantly, our customized Makeflow framework enabled elastic deployment on a wider variety of systems while substantially reducing wall clock runtimes using hundreds of cores.

References

J. K. Colbourne, M. E. Pfrender, D. Gilbert, W. K. Thomas, and others. The Ecoresponsive Genome of phDaphnia pulex. Science, 331(6017):555--561, 2011.Google ScholarCross Ref
A. E. Darling, L. Carey, and W. chun Feng. The design, implementation, and evaluation of mpiBLAST. In In Proceedings of ClusterWorld 2003, 2003.Google Scholar
P. H.-P. Implementation and E. O. E. Lusk. Scalable unix commands for parallel. In In Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 410--418. Springer, 2001. Google ScholarDigital Library
M. L. Kazar. Synchronization and caching issues in the andrew file system, 1988.Google Scholar
J. Korbel, A. Abyzov, X. Mu, N. Carriero, P. Cayting, Z. Zhang, M. Snyder, and M. Gerstein. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology, 10(2):R23+, 2009.Google Scholar
C. Moretti, J. Bulosan, D. Thain, and P. J. Flynn. All-pairs: An abstraction for data-intensive cloud computing. In 22nd IEEE International Symposium on Parallel and Distributed Processing, IPDPS, pages 1--11, 2008.Google ScholarCross Ref
C. Moretti, M. Olson, S. J. Emrich, and D. Thain. Highly scalable genome assembly on campus grids. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS, 2009. Google ScholarDigital Library
A. Pang, J. MacDonald, D. Pinto, J. Wei, et al. Towards a comprehensive structural variation map of an individual human genome. Genome Biology, 11(5):R52+, 2010.Google Scholar
M. C. Schatz, B. Langmead, and S. L. Salzberg. Cloud computing and the DNA data race. Nature Biotechnology, 28(7):691--693, 2010.Google ScholarCross Ref
A. Thrasher, R. Carmichael, P. Bui, L. Yu, D. Thain, and S. Emrich. Taming complex bioinformatics workflows with Weaver, Makeflow, and Starch. In In Proceedings of 5th Workshop of Workflows in Support of Large-Scale Science 2010, 2010.Google ScholarCross Ref
O. Trelles. On the parallelisation of bioinformatics applications. Briefings in Bioinformatics, 2(2):181--219, 2001.Google ScholarCross Ref
L. Yu, C. Moretti, A. Thrasher, S. J. Emrich, K. Judd, and D. Thain. Harnessing parallelism in multicore clusters with the All-Pairs, Wavefront, and Makeflow abstractions. Cluster Computing, 13(3):243--256, 2010. Google ScholarDigital Library

Index Terms

Adapting bioinformatics applications for heterogeneous systems: a case study
1. Information systems

Recommendations

Interoperability of GADU in Using Heterogeneous Grid Resources for Bioinformatics Applications

Bioinformatics tools used for efficient and computationally intensive analysis of genetic sequences require large-scale computational resources to accommodate the growing data. Grid computational resources such as the Open Science Grid and TeraGrid have ...
Read More
A CORBA compliant transactional workflow system for internet applications
Middleware '98: Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing

The paper describes an application composition and execution environment implemented as a transactional workflow system that enables sets of inter-related tasks to be carried out and supervised in a dependable manner. The paper describes how the system ...
Read More
Adapting bioinformatics applications for heterogeneous systems: a case study

The advent of new sequencing technologies has generated extremely large amounts of information. To successfully apply bioinformatics tools to such large datasets, they need to exhibit scalability and ideally elasticity in diverse computing environments. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ECMLS '11: Proceedings of the second international workshop on Emerging computational methods for the life sciences
June 2011
44 pages
ISBN:9781450307024
DOI:10.1145/1996023
Program Chairs:
Ian Foster
University of Chicago, USA
,
Judy Qiu
Indiana University, USA
,
Ronald Taylor
Pacific Northwest National Laboratory, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 June 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bioinformatics
distributed systems
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of13submissions,54%
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 134
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Adapting bioinformatics applications for heterogeneous systems: a case study

ECMLS '11: Proceedings of the second international workshop on Emerging computational methods for the life sciences

ABSTRACT

References

Cited By

Index Terms

Recommendations

Interoperability of GADU in Using Heterogeneous Grid Resources for Bioinformatics Applications

A CORBA compliant transactional workflow system for internet applications

Adapting bioinformatics applications for heterogeneous systems: a case study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Adapting bioinformatics applications for heterogeneous systems: a case study

ECMLS '11: Proceedings of the second international workshop on Emerging computational methods for the life sciences

ABSTRACT

References

Cited By

Index Terms

Recommendations

Interoperability of GADU in Using Heterogeneous Grid Resources for Bioinformatics Applications

A CORBA compliant transactional workflow system for internet applications

Adapting bioinformatics applications for heterogeneous systems: a case study

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media