article

Designing Patterns and Profiles for Faster HMM Search

Authors:
Yanni Sun

Washington University, St. Louis

Washington University, St. Louis
View Profile

,
Jeremy Buhler

Washington University, St. Louis

Washington University, St. Louis
View Profile

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 6 Issue 2pp 232–243https://doi.org/10.1109/TCBB.2008.14

Published:01 April 2009Publication History

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

Profile HMMs are powerful tools for modeling conserved motifs in proteins. They are widely used by search tools to classify new protein sequences into families based on domain architecture. However, the proliferation of known motifs and new proteomic sequence data poses a computational challenge for search, requiring days of CPU time to annotate an organism's proteome. It is highly desirable to speed up HMM search in large databases. We design PROSITE-like patterns and short profiles that are used as filters to rapidly eliminate protein-motif pairs for which a full profile HMM comparison does not yield a significant match. The design of the pattern-based filters is formulated as a multichoice knapsack problem. Profile-based filters with high sensitivity are extracted from a profile HMM based on their theoretical sensitivity and false positive rate. Experiments show that our profile-based filters achieve high sensitivity (near 100 percent) while keeping around 20\times speedup with respect to the unfiltered search program. Pattern-based filters typically retain at least 90 percent of the sensitivity of the source HMM with 30-40\times speedup. The profile-based filters have sensitivity comparable to the multistage filtering strategy HMMERHEAD [15] and are faster in most of our experiments.

References

T.K. Attwood, P. Bradley, D.R. Flower, A. Gaulton, N. Maudling, A. Mitchell, G. Moulton, A. Nordle, K. Paine, P. Taylor, A. Uddin, and C. Zygouri, "PRINTS and Its Automatic Supplement," Nucleic Acids Research, vol. 31, pp. 400-402, 2003.Google ScholarCross Ref
A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats, and S.R. Eddy, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 32, pp. D138-D141, 2004.Google ScholarCross Ref
C. Bru, E. Courcelle, S. Carrere, Y. Beausse, S. Dalmar, and D. Kahn, "The ProDom Database of Protein Domain Families: More Emphasis on 3D," Nucleic Acids Research, database issue, vol. 33, pp. D212-D215, 2005.Google ScholarCross Ref
A.K. Chandra, D.S. Hirschberg, and C.K. Wong, "Approximate Algorithms for Some Generalized Knapsack Problems," Theoretical Computer Science, vol. 3, pp. 293-304, 1976.Google ScholarCross Ref
S.R. Eddy, "Profile Hidden Markov Models," Bioinformatics, vol. 14, no. 9, pp. 755-763, 1998.Google ScholarCross Ref
P.M.K. Gordon and C.W. Sensen, "Osprey: A Comprehensive Tool Employing Novel Methods for the Design of Oligonucleotides for DNA Sequencing and Microarrays," Nucleic Acids Research, vol. 32, no. 17, p. e133, 2004.Google ScholarCross Ref
S. Henikoff, J.G. Henikoff, and S. Pietrokovski, "Blocks+: A Non-Redundant Database of Protein Alignment Blocks Derived from Multiple Compilations," Bioinformatics, vol. 15, no. 6, pp. 471-479, 1999.Google ScholarCross Ref
J.G. Henikoff, E.A. Greene, S. Pietrokovski, and S. Henikoff, "Increased Coverage of Protein Families with the Blocks Database Servers," Nucleic Acids Research, vol. 28, pp. 228-230, 2000.Google ScholarCross Ref
N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro, P.S. Langendijk-Genevaux, M. Pagni, and C.J.A. Sigrist, "The PROSITE Database," Nucleic Acids Research, vol. 34, pp. 227-230, 2006.Google ScholarCross Ref
O.H. Ibarra and C.E. Kim, "Fast Approximation Algorithms for the Knapsack and Sum of Subset Problems," J. ACM, vol. 22, pp. 463-468, 1975. Google ScholarDigital Library
A. Krogh, M. Brown, I.S. Mian, K. Sjolander, and D. Haussler, "Hidden Markov Models in Computational Biology: Applications to Protein Modeling," J. Molecular Biology, vol. 235, pp. 1501-1531, 1994.Google ScholarCross Ref
I. Letunic, R.R. Copley, B. Pils, S. Pinkert, J. Schultz, and P. Bork, "SMART 5: Domains in the Context of Genomes and Networks," Nucleic Acids Research, vol. 1, no. 34, database issue, pp. D257- D260, 2006.Google Scholar
M. Li, B. Ma, D. Kisman, and J. Tromp, "PatternHunter II: Highly Sensitive and Fast Homology Search," J. Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 417-439, 2004.Google ScholarCross Ref
M. Madera, C. Vogel, S.K. Kummerfeld, C. Chothia, and J. Gough, "The SUPERFAMILY Database in 2004: Additions and Improvements," Nucleic Acids Research, vol. 32, pp. D235-D239, 2004.Google ScholarCross Ref
E. Portugaly and M. Ninio, "HMMERHEAD--Accelerating HMM Searches on Large Databases (Poster)," Proc. Eighth Ann. Int'l Conf. Computational Molecular Biology (RECOMB), 2004.Google Scholar
E. Portugaly, A. Harel, N. Linial, and M. Linial, "EVEREST: Automatic Identification and Classification of Protein Domains in All Protein Sequences," BMC Bioinformatics, vol. 7, p. 277, 2006.Google ScholarCross Ref
L.R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. IEEE, vol. 77, no. 2, pp. 257-286, 1989. Google ScholarCross Ref
J. Schultz, F. Milpetz, P. Bork, and C.P. Ponting, "SMART, a Simple Modular Architecture Research Tool: Identification of Signaling Domains," Proc. Nat'l Academy Sciences USA, vol. 95, pp. 5857-5864, 1998.Google ScholarCross Ref
Y. Sun and J. Buhler, "Designing Multiple Simultaneous Seeds for DNA Similarity Search," Proc. Eighth Ann. Int'l Conf. Computational Molecular Biology (RECOMB '04), pp. 76-84, 2004. Google ScholarDigital Library
Y. Sun and J. Buhler, "Designing Patterns for Profile HMM Search," Bioinformatics, vol. 23, no. 2, pp. e36-e43, 2006. Google ScholarDigital Library
J. Xu, D.G. Brown, M. Li, and B. Ma, "Optimizing Multiple Spaced Seeds for Homology Search," Lecture Notes in Computer Science, vol. 3109, pp. 47-58, Springer 2004.Google ScholarCross Ref

Index Terms

Recommendations

Designing patterns for profile HMM search

Motivation: Profile HMMs are a powerful tool for modeling conserved motifs in proteins. These models are widely used by search tools to classify new protein sequences into families based on domain architecture. However, the proliferation of known ...
Read More
Novel Multisample Scheme for Inferring Phylogenetic Markers from Whole Genome Tumor Profiles

Computational cancer phylogenetics seeks to enumerate the temporal sequences of aberrations in tumor evolution, thereby delineating the evolution of possible tumor progression pathways, molecular subtypes, and mechanisms of action. We previously ...
Read More
MeDIP-HMM

Motivation: Methylation of cytosines in DNA is an important epigenetic mechanism involved in transcriptional regulation and preservation of genome integrity in a wide range of eukaryotes. Immunoprecipitation of methylated DNA followed by ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 6, Issue 2
April 2009
191 pages
ISSN:1545-5963
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
IEEE Computer Society Press
Washington, DC, United States
Publication History
- Published: 1 April 2009
Published in tcbb Volume 6, Issue 2
Author Tags
Biology and genetics
bioinformatics databases
hidden Markov models.
sequence similarity search
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 200
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Designing Patterns and Profiles for Faster HMM Search

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

References

Cited By

Index Terms

Recommendations

Designing patterns for profile HMM search

Novel Multisample Scheme for Inferring Phylogenetic Markers from Whole Genome Tumor Profiles

MeDIP-HMM

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Designing Patterns and Profiles for Faster HMM Search

IEEE/ACM Transactions on Computational Biology and Bioinformatics

Abstract

References

Cited By

Index Terms

Recommendations

Designing patterns for profile HMM search

Novel Multisample Scheme for Inferring Phylogenetic Markers from Whole Genome Tumor Profiles

MeDIP-HMM

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media