poster

Poster: fast GPU read alignment with burrows wheeler transform based index

Authors:
Aleksandr Drozd

Tokyo Institute of Technology, Tokyo, Japan

Tokyo Institute of Technology, Tokyo, Japan
View Profile

,
Naoya Maruyama

Tokyo Institute of Technology, Tokyo, Japan

Tokyo Institute of Technology, Tokyo, Japan
View Profile

,
Satoshi Matsuoka

Tokyo Institute of Technology, Tokyo, Japan

Tokyo Institute of Technology, Tokyo, Japan
View Profile

SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis CompanionNovember 2011Pages 21–22https://doi.org/10.1145/2148600.2148612

Published:12 November 2011Publication History

SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion

Pages 21–22

ABSTRACT

We address the problem of performing faster read alignment on GPU devices. The task of DNA sequence processing is extremely computationally intensive as constant progress in sequencing technology leads to ever-increasing amounts of sequence data[6]. One of possible solutions for this problem is to use the extreme parallel capacities of modern GPU devices[5]. However, performance characteristics and programming models for GPU differ from those of traditional architectures and require new approaches.

Most importantly, host memory and I/O systems are not directly accessible from a GPU device and GPU memory is usually an order of magnitude smaller than memory on a host. Considering the size of read alignment data, the memory limit becomes a real problem: when reference sequence index does not fit into memory it has to be split into chunks that will be processed individually. In most cases the complexity of the algorithm does not depend on the index size, so such index splitting increases computation time tremendously. Analysis of existing solutions for read alignment on GPU showed that memory limit is the chief performance issue. One of the attempts to reduce memory consumption consisted in replacing commonly used suffix tree, which allows for better theoretical performance of the algorithm [4], with suffix array, which is less efficient in terms of pure computational complexity but more compact. By doing this, authors of MummerGPU++ achieved several times better performance[3].

We suggest using Burrows-Wheeler Transform[1] for both index and the corresponding search algorithm to achieve much smaller memory footprint. This transform is used mainly in compression algorithms such as bzip2 as it replaces reoccurring patterns in the string by continuous runs of a single symbol, but it can be also used for pattern matching[2]. At the same time we continue using more traditional suffix array on host side to benefit from computational characteristics of both GPU and CPU. We reduced index size 12 times and just by doing this achieved 3-4 time performance improvement compared to suffix-array based solution MummerGPU++.

Since even with this compressed index workload size can exceed available device memory we developed a performance model to analyze how overall execution time is affected by proportions and succession in memory is allocated for chunks of index and query set. This model allowed us to find best balance of memory allocation and double performance compared to naive approach when we allocate equal shares of memory for index and queries. The model is then applied to show that using multiple GPUs is a way not only to speed up application, but also to overcome some single-GPU performance issues and have super-linear scaling at least on number of GPUs typically available on one host.

Supplemental Material

Available for Download

pdf

post139.pdf (769.4 KB)

References

M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.Google Scholar
P. Ferragina and G. Manzini. Indexing compressed text. Journal of the ACM, 53(4):552--581, 2005. Google ScholarDigital Library
A. Gharaibeh and M. Ripeanu. Size matters: Space/time tradeoffs to improve gpgpu applications performance. In SC '10 Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2010. Google ScholarDigital Library
D. Gusfield. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997. Google ScholarDigital Library
J. D. Owens, D. Luebke, N. Govindaraju, M. Harris, J. Kruger, A. E. Lefohm, and T. J. Purcell. A survay on general-purpose computation on graphics hardware. Computer Graphics Forum, 26(1):80--113, 2007.Google ScholarCross Ref
M. Pop. Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics, 10:354, 2009.Google ScholarCross Ref

Index Terms

Poster: fast GPU read alignment with burrows wheeler transform based index
1. Software and its engineering

Recommendations

Electronic poster: a massively parallel lattice Monte Carlo algorithm in CUDA for thermal conduction simulations
SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion

We present a highly parallel CUDA kernel based on the Lattice Monte Carlo (LMC) method for transient thermal conduction, which achieves a peak acceleration of more than 100x over a single-threaded Fortran version. A number of memory and branching ...
Read More
Poster: 3D tixels: a highly efficient algorithm for gpu/cpu-acceleration of molecular dynamics on heterogeneous parallel architectures
SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion

Several GPU-based algorithms have been developed to accelerate biomolecular simulations, but although they provide benefits over single-core implementations, they have not been able to surpass the performance of state-of-the art SIMD CPU implementations ...
Read More
Student Research Poster: Slack-Aware Shared Bandwidth Management in GPUs
PACT '16: Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Due to lack of sufficient compute threads in memory-intensive applications, GPUs often exhaust all the active warps and therefore, the memory latencies get exposed and appear in the critical path. In such a scenario, the shared on-chip and off-chip ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion
November 2011
166 pages
ISBN:9781450310307
DOI:10.1145/2148600
Conference Chair:
Scott Lathrop
University of Chicago
,
Program Chairs:
Jim Costa
Sandia National Laboratories
,
William Kramer
National Center for Supercomputing Applications
Copyright © 2011 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 November 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
gpu
read alignment
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,516of6,373submissions,24%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 138
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Poster: fast GPU read alignment with burrows wheeler transform based index

SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Electronic poster: a massively parallel lattice Monte Carlo algorithm in CUDA for thermal conduction simulations

Poster: 3D tixels: a highly efficient algorithm for gpu/cpu-acceleration of molecular dynamics on heterogeneous parallel architectures

Student Research Poster: Slack-Aware Shared Bandwidth Management in GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Poster: fast GPU read alignment with burrows wheeler transform based index

SC '11 Companion: Proceedings of the 2011 companion on High Performance Computing Networking, Storage and Analysis Companion

ABSTRACT

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

Electronic poster: a massively parallel lattice Monte Carlo algorithm in CUDA for thermal conduction simulations

Poster: 3D tixels: a highly efficient algorithm for gpu/cpu-acceleration of molecular dynamics on heterogeneous parallel architectures

Student Research Poster: Slack-Aware Shared Bandwidth Management in GPUs

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media