Article

Superiority and complexity of the spaced seeds

Authors:

Louxin ZhangAuthors Info & Claims

SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm

Pages 444 - 453

Published: 22 January 2006 Publication History

Abstract

Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. They are now serving thousands of homology search queries daily. While dozens of papers have been published on optimal spaced seeds since their invention, many fundamental questions still remain unanswered. In this paper, we settle several open questions in this area. Specifically, we prove that when the length of a non-uniformly spaced seed is bounded by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed in both (i) the average number of non-overlapping hits and (ii) the asymptotic hit probability. Then, we study the computation of the hit probability of a spaced seed, solving three more open questions: (iii) hit probability computation in a uniform homologous region is NP-hard and (iv) it admits a PTAS; (v) the asymptotic hit probability is computable in exponential time in seed length, independent of the homologous region length.

References

[1]

S. F. Altschul et al., Basic local alignment search tool, J. Mol. Biol. 215(1990), pp. 403--410.

[2]

S. F. Altschul et al., Gapped Blast and Psi-Blast: a new generation of protein database search programs, Nucleic Acids Res. 25(1997), pp. 3389--3402.

[3]

N. Balakrishnan and M. V. Koutras, Runs and Scans with Applications, John Wiley & Sons, U.S.A., 2002.

[4]

G. Birkhoff, Extension of Jentzsch's theorem, Trans. Amer. Math. Soc. 85(1957), pp. 219--227.

[5]

B. Brejovà, D. Brown, and T. Vinař, Optimal spaced seeds for homologous coding regions. J. Bioinf. and Comp. Biol. 1(2004), pp. 595--610. Early version appeared in CPM 2003.

[6]

J. Buhler, U. Keich, and Y. Sun, Designing seeds for similarity search in genomic DNA. Proc. 7th Annual Int'l Conf. on Comput. Mol. Biol. (RECOMB'03), pp. 67--75, Berlin, Germany.

Digital Library

[7]

K. P. Choi, and L. Zhang, Sensitivity analysis and efficient method for identifying optimal spaced seeds, J. Comput. Sys. Sci., 68(2004), pp. 22--40.

Digital Library

[8]

K. P. Choi, F. Zeng, and L. Zhang, Good Spaced Seeds for Homology Search, Bioinformatics20(2004), pp. 1053--1059.

Digital Library

[9]

W. Feller, An introduction to Probability Theory and its Applications, Vol. I (3rd edition), Wiley, New York.

[10]

F. Nicolas, and E. Rivals, Hardness of Optimal Spaced Seed Designi, Proc. 16th Annual Symposium Combinatorial Pattern Matching (CPM'05), 2005, pp. 144--155.

Digital Library

[11]

M. Garey, and D. Johnson, A guide to the theory of NP-completeness, W. H. Freeman, 1979.

Digital Library

[12]

V. Gotea, V. Veeramachaneni, and W. Makalowski, Mastering seeds for genomic size nucleotide BLAST searches, Nucleic Acids Res. 31(2003), pp. 6935--6941.

[13]

L. J. Guibas and A. M. Odlyzko, String overlaps, pattern matching, and nontransitive games, J. of Combin. Theory (series A) 30(1981), pp. 183--208.

[14]

Int'l Mouse Genome Sequencing Consortium, Initial sequencing and comparative analysis of the mouse genome. Nature409(2002), pp. 520--562.

[15]

P. Jacquet and W. Szpankowski, Analytic Approach to Pattern Matching, In Applied Combinatorics on Words (Editor: M. Lothaire), Cambridge Press, 2005.

[16]

U. Keich, M. Li, B. Ma, and J. Tromp, On Spaced Seeds for Similarity Search, Discrete Appl. Math. 3(2004), pp. 253--263.

Digital Library

[17]

G. Kucherov, L. Noe, and Y. Ponty, Estimating seed sensitivity on homogeneous alignments, In Proc. IEEE 4th Symp. on Bioinformatics and Bioengineering, Taiwan, 2004, pp. 387--394.

Digital Library

[18]

M. Li, B. Ma, and L. Wang, On the Closest String and Substring Problems, Journal of the ACM49(2002), pp. 157--171.

Digital Library

[19]

M. Li, B. Ma, D. Kisman, and J. Tromp, Pattern-HunterII: highly sensitive and fast homology search. J. Bioinformatics and Comput. Biol. 2(2004), pp. 417--440.

[20]

D. J. Lipman and W. R. Pearson, Rapid and sensitive protein similarity searches, Science227(1985), pp. 1435--1441.

[21]

B. Ma, J. Tromp, and M. Li, PatternHunter: faster and more sensitive homology search, Bioinformatics, 18(2002), pp. 440--445.

[22]

S. R. Mahaney, Sparse complete sets for NP: soluton of a conjecture of Berman and Hartmanis, J. Comput. System Sci. 25(1982), pp. 130--143.

[23]

H. Minc, Nonnegative matrices, John Wiley and Sons, New York, 1988.

[24]

National Center for Biotechnology Information. Growth of GenBank, 2002.

[25]

S. B. Needleman and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol. 48(1970), pp. 443--453.

[26]

P. Nicodéme, B. Salvy, and P. Flajolet, Motif Satistics. Lecture Notes in Computer Sciences, vol. 1643, 1999, pp. 194--211.

Digital Library

[27]

F. P. Preparata, L. Zhang, and K. P. Choi, Quick, practical selection of effective seeds for homology search, To appear in J. Comput. Biology, 2005.

[28]

G. Reinert, S. Schbath, and M. Waterman, Probabilistic and statistical properties of words: An overview, J. Comput. Biol. 7(2000), 1--46.

[29]

S. Schwartz et al. Human-Mouse alignment with BLASTZ, Genome Res.13(2003), pp. 103--107.

[30]

T. F. Smith and M. S. Waterman, Identification of common molecular subsequences, J. Mol. Biol. 147 (1981), pp. 195--197.

[31]

A. D. Solov'ev, A combinatorial identity and its application to the problem concerning the first occurences of a rare event, Theory of Probab. and Appl. 11(1966), pp. 276--282.

[32]

Y. Sun and J. Buhler, Designing multiple simultaneous seeds for DNA similarity search, in Proc. of RECOMB'04, 2004, pp. 76--85.

Digital Library

[33]

J. Xu, D. Brown, M. Li, and B. Ma, Optimizing multiple spaced seeds for homology search, In Proc. of CPM'04, LNCS, vol. 3109, pp. 47--58. Final version to appear in J. Comp. Biol.

[34]

I.-H. Yang et al. Efficient Methods for Generating Optimal Single and Multiple Spaced Seeds, In Proc. IEEE 4th Symp. on Bioinformatics and Bioengineering, Taiwan, 2004, pp. 411--418.

Digital Library

[35]

Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, A greedy algorithm for aligning DNA sequences. J. Comput. Biology. 7(2000), pp. 203--214.

Cited By

Li WMa BZhang K(2014)Optimizing spaced k-mer neighbors for efficient filtration in protein similarity searchIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2014.230683111:2(398-406)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1109/TCBB.2014.2306831
Marschall THerms IKaltenbach HRahmann S(2012)Probabilistic Arithmetic Automata and Their ApplicationsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2012.1099:6(1737-1750)Online publication date: 1-Nov-2012
https://dl.acm.org/doi/10.1109/TCBB.2012.109
Treangen TDarling AAchaz GRagan MMesseguer XRocha E(2009)A Novel Heuristic for Local Multiple Alignment of Interspersed DNA RepeatsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2009.96:2(180-189)Online publication date: 1-Apr-2009
https://dl.acm.org/doi/10.1109/TCBB.2009.9
Show More Cited By

Index Terms

Superiority and complexity of the spaced seeds
1. Theory of computation
  1. Design and analysis of algorithms
  2. Models of computation
    1. Probabilistic computation

Recommendations

On the complexity of the spaced seeds

Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. These seeds are serving many homology queries daily. However the computational complexity of ...
Superiority of Spaced Seeds for Homology Search

In homology search, good spaced seeds have higher sensitivity for the same cost (weight). However, elucidating the mechanism that confers power to spaced seeds and characterizing optimal spaced seeds still remain unsolved. This paper investigates these ...
Good spaced seeds for homology search

Motivation: Filtration is an important technique used to speed up local alignment as exemplified in the BLAST programs. Recently, Ma et al. discovered that better filtering can be achieved by spacing out the matching positions according to a certain ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm

January 2006

1261 pages

ISBN:0898716055

Sponsors

SIAM Activity Group on Discrete Mathematics
SIGACT: ACM Special Interest Group on Algorithms and Computation Theory

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 22 January 2006

Check for updates

Qualifiers

Article

Acceptance Rates

Overall Acceptance Rate 411 of 1,322 submissions, 31%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
271
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Li WMa BZhang K(2014)Optimizing spaced k-mer neighbors for efficient filtration in protein similarity searchIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2014.230683111:2(398-406)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1109/TCBB.2014.2306831
Marschall THerms IKaltenbach HRahmann S(2012)Probabilistic Arithmetic Automata and Their ApplicationsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2012.1099:6(1737-1750)Online publication date: 1-Nov-2012
https://dl.acm.org/doi/10.1109/TCBB.2012.109
Treangen TDarling AAchaz GRagan MMesseguer XRocha E(2009)A Novel Heuristic for Local Multiple Alignment of Interspersed DNA RepeatsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2009.96:2(180-189)Online publication date: 1-Apr-2009
https://dl.acm.org/doi/10.1109/TCBB.2009.9
Herms IRahmann S(2008)Computing Alignment Seed Sensitivity with Probabilistic Arithmetic AutomataProceedings of the 8th international workshop on Algorithms in Bioinformatics10.1007/978-3-540-87361-7_27(318-329)Online publication date: 15-Sep-2008
https://dl.acm.org/doi/10.1007/978-3-540-87361-7_27
Ilie LIlie S(2007)Fast computation of good multiple spaced seedsProceedings of the 7th international conference on Algorithms in Bioinformatics10.5555/2391870.2391902(346-358)Online publication date: 8-Sep-2007
https://dl.acm.org/doi/10.5555/2391870.2391902
Peterlongo PNoé LLavenier DGeorges GJacques JKucherov GGiraud M(2007)Protein similarity search with subset seeds on a dedicated reconfigurable hardwareProceedings of the 7th international conference on Parallel processing and applied mathematics10.5555/1786194.1786340(1240-1248)Online publication date: 9-Sep-2007
https://dl.acm.org/doi/10.5555/1786194.1786340
Gao XLi SLu Y(2007)New algorithms for the spaced seedsProceedings of the 1st annual international conference on Frontiers in algorithmics10.5555/1776166.1776171(50-61)Online publication date: 1-Aug-2007
https://dl.acm.org/doi/10.5555/1776166.1776171
Darling ATreangen TZhang LKuiken CMesseguer XPerna N(2006)Procrastination leads to efficient filtration for local multiple alignmentProceedings of the 6th international conference on Algorithms in Bioinformatics10.5555/2091073.2091085(126-137)Online publication date: 11-Sep-2006
https://dl.acm.org/doi/10.5555/2091073.2091085

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten