skip to main content
article

Designing Patterns and Profiles for Faster HMM Search

Published:01 April 2009Publication History
Skip Abstract Section

Abstract

Profile HMMs are powerful tools for modeling conserved motifs in proteins. They are widely used by search tools to classify new protein sequences into families based on domain architecture. However, the proliferation of known motifs and new proteomic sequence data poses a computational challenge for search, requiring days of CPU time to annotate an organism's proteome. It is highly desirable to speed up HMM search in large databases. We design PROSITE-like patterns and short profiles that are used as filters to rapidly eliminate protein-motif pairs for which a full profile HMM comparison does not yield a significant match. The design of the pattern-based filters is formulated as a multichoice knapsack problem. Profile-based filters with high sensitivity are extracted from a profile HMM based on their theoretical sensitivity and false positive rate. Experiments show that our profile-based filters achieve high sensitivity (near 100 percent) while keeping around 20\times speedup with respect to the unfiltered search program. Pattern-based filters typically retain at least 90 percent of the sensitivity of the source HMM with 30-40\times speedup. The profile-based filters have sensitivity comparable to the multistage filtering strategy HMMERHEAD [15] and are faster in most of our experiments.

References

  1. T.K. Attwood, P. Bradley, D.R. Flower, A. Gaulton, N. Maudling, A. Mitchell, G. Moulton, A. Nordle, K. Paine, P. Taylor, A. Uddin, and C. Zygouri, "PRINTS and Its Automatic Supplement," Nucleic Acids Research, vol. 31, pp. 400-402, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  2. A. Bateman, L. Coin, R. Durbin, R.D. Finn, V. Hollich, S. Griffiths-Jones, A. Khanna, M. Marshall, S. Moxon, E.L.L. Sonnhammer, D.J. Studholme, C. Yeats, and S.R. Eddy, "The Pfam Protein Families Database," Nucleic Acids Research, vol. 32, pp. D138-D141, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  3. C. Bru, E. Courcelle, S. Carrere, Y. Beausse, S. Dalmar, and D. Kahn, "The ProDom Database of Protein Domain Families: More Emphasis on 3D," Nucleic Acids Research, database issue, vol. 33, pp. D212-D215, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  4. A.K. Chandra, D.S. Hirschberg, and C.K. Wong, "Approximate Algorithms for Some Generalized Knapsack Problems," Theoretical Computer Science, vol. 3, pp. 293-304, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  5. S.R. Eddy, "Profile Hidden Markov Models," Bioinformatics, vol. 14, no. 9, pp. 755-763, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  6. P.M.K. Gordon and C.W. Sensen, "Osprey: A Comprehensive Tool Employing Novel Methods for the Design of Oligonucleotides for DNA Sequencing and Microarrays," Nucleic Acids Research, vol. 32, no. 17, p. e133, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  7. S. Henikoff, J.G. Henikoff, and S. Pietrokovski, "Blocks+: A Non-Redundant Database of Protein Alignment Blocks Derived from Multiple Compilations," Bioinformatics, vol. 15, no. 6, pp. 471-479, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  8. J.G. Henikoff, E.A. Greene, S. Pietrokovski, and S. Henikoff, "Increased Coverage of Protein Families with the Blocks Database Servers," Nucleic Acids Research, vol. 28, pp. 228-230, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  9. N. Hulo, A. Bairoch, V. Bulliard, L. Cerutti, E. De Castro, P.S. Langendijk-Genevaux, M. Pagni, and C.J.A. Sigrist, "The PROSITE Database," Nucleic Acids Research, vol. 34, pp. 227-230, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  10. O.H. Ibarra and C.E. Kim, "Fast Approximation Algorithms for the Knapsack and Sum of Subset Problems," J. ACM, vol. 22, pp. 463-468, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Krogh, M. Brown, I.S. Mian, K. Sjolander, and D. Haussler, "Hidden Markov Models in Computational Biology: Applications to Protein Modeling," J. Molecular Biology, vol. 235, pp. 1501-1531, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  12. I. Letunic, R.R. Copley, B. Pils, S. Pinkert, J. Schultz, and P. Bork, "SMART 5: Domains in the Context of Genomes and Networks," Nucleic Acids Research, vol. 1, no. 34, database issue, pp. D257- D260, 2006.Google ScholarGoogle Scholar
  13. M. Li, B. Ma, D. Kisman, and J. Tromp, "PatternHunter II: Highly Sensitive and Fast Homology Search," J. Bioinformatics and Computational Biology, vol. 2, no. 3, pp. 417-439, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  14. M. Madera, C. Vogel, S.K. Kummerfeld, C. Chothia, and J. Gough, "The SUPERFAMILY Database in 2004: Additions and Improvements," Nucleic Acids Research, vol. 32, pp. D235-D239, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  15. E. Portugaly and M. Ninio, "HMMERHEAD--Accelerating HMM Searches on Large Databases (Poster)," Proc. Eighth Ann. Int'l Conf. Computational Molecular Biology (RECOMB), 2004.Google ScholarGoogle Scholar
  16. E. Portugaly, A. Harel, N. Linial, and M. Linial, "EVEREST: Automatic Identification and Classification of Protein Domains in All Protein Sequences," BMC Bioinformatics, vol. 7, p. 277, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  17. L.R. Rabiner, "A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition," Proc. IEEE, vol. 77, no. 2, pp. 257-286, 1989. Google ScholarGoogle ScholarCross RefCross Ref
  18. J. Schultz, F. Milpetz, P. Bork, and C.P. Ponting, "SMART, a Simple Modular Architecture Research Tool: Identification of Signaling Domains," Proc. Nat'l Academy Sciences USA, vol. 95, pp. 5857-5864, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  19. Y. Sun and J. Buhler, "Designing Multiple Simultaneous Seeds for DNA Similarity Search," Proc. Eighth Ann. Int'l Conf. Computational Molecular Biology (RECOMB '04), pp. 76-84, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Sun and J. Buhler, "Designing Patterns for Profile HMM Search," Bioinformatics, vol. 23, no. 2, pp. e36-e43, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Xu, D.G. Brown, M. Li, and B. Ma, "Optimizing Multiple Spaced Seeds for Homology Search," Lecture Notes in Computer Science, vol. 3109, pp. 47-58, Springer 2004.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Designing Patterns and Profiles for Faster HMM Search

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in

                    Full Access

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader