skip to main content
10.1145/1996023.1996025acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

Adapting bioinformatics applications for heterogeneous systems: a case study

Published:08 June 2011Publication History

ABSTRACT

The advent of new sequencing technologies has generated extremely large amounts of information. To successfully apply bioinformatics tools to such large datasets, they need to exhibit scalability and ideally elasticity in diverse computing environments. We describe the application of Weaver to the PEMer structural variation detection workflow. Because the original workflow has an intractable sequential running time on large datasets, it also has a batch implementation designed for a shared file system. Using scripts provided by the developers of PEMer, along with the Weaver Python module, the Starch archive generator, and the Makeflow workflow engine, we have refactored PEMer for elastic scaling on personal clouds. Our case study describes the various challenges faced when constructing such a workflow, from dealing with failure detection, to managing dependencies, to handling the quirks of the underlying operating systems. The practice of scaling bioinformatics tools is increasingly commonplace. As such, the hands-on application of refactoring techniques to PEMer can serve as a valuable guide for those looking to reconfigure other bioinformatics software. Significantly, our customized Makeflow framework enabled elastic deployment on a wider variety of systems while substantially reducing wall clock runtimes using hundreds of cores.

References

  1. J. K. Colbourne, M. E. Pfrender, D. Gilbert, W. K. Thomas, and others. The Ecoresponsive Genome of phDaphnia pulex. Science, 331(6017):555--561, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  2. A. E. Darling, L. Carey, and W. chun Feng. The design, implementation, and evaluation of mpiBLAST. In In Proceedings of ClusterWorld 2003, 2003.Google ScholarGoogle Scholar
  3. P. H.-P. Implementation and E. O. E. Lusk. Scalable unix commands for parallel. In In Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, pages 410--418. Springer, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. L. Kazar. Synchronization and caching issues in the andrew file system, 1988.Google ScholarGoogle Scholar
  5. J. Korbel, A. Abyzov, X. Mu, N. Carriero, P. Cayting, Z. Zhang, M. Snyder, and M. Gerstein. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biology, 10(2):R23+, 2009.Google ScholarGoogle Scholar
  6. C. Moretti, J. Bulosan, D. Thain, and P. J. Flynn. All-pairs: An abstraction for data-intensive cloud computing. In 22nd IEEE International Symposium on Parallel and Distributed Processing, IPDPS, pages 1--11, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  7. C. Moretti, M. Olson, S. J. Emrich, and D. Thain. Highly scalable genome assembly on campus grids. In Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Pang, J. MacDonald, D. Pinto, J. Wei, et al. Towards a comprehensive structural variation map of an individual human genome. Genome Biology, 11(5):R52+, 2010.Google ScholarGoogle Scholar
  9. M. C. Schatz, B. Langmead, and S. L. Salzberg. Cloud computing and the DNA data race. Nature Biotechnology, 28(7):691--693, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  10. A. Thrasher, R. Carmichael, P. Bui, L. Yu, D. Thain, and S. Emrich. Taming complex bioinformatics workflows with Weaver, Makeflow, and Starch. In In Proceedings of 5th Workshop of Workflows in Support of Large-Scale Science 2010, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  11. O. Trelles. On the parallelisation of bioinformatics applications. Briefings in Bioinformatics, 2(2):181--219, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  12. L. Yu, C. Moretti, A. Thrasher, S. J. Emrich, K. Judd, and D. Thain. Harnessing parallelism in multicore clusters with the All-Pairs, Wavefront, and Makeflow abstractions. Cluster Computing, 13(3):243--256, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Adapting bioinformatics applications for heterogeneous systems: a case study

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ECMLS '11: Proceedings of the second international workshop on Emerging computational methods for the life sciences
      June 2011
      44 pages
      ISBN:9781450307024
      DOI:10.1145/1996023

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 8 June 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate7of13submissions,54%

      Upcoming Conference

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader