skip to main content
10.5555/1109557.1109607acmconferencesArticle/Chapter ViewAbstractPublication PagessodaConference Proceedingsconference-collections
Article

Superiority and complexity of the spaced seeds

Published: 22 January 2006 Publication History

Abstract

Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. They are now serving thousands of homology search queries daily. While dozens of papers have been published on optimal spaced seeds since their invention, many fundamental questions still remain unanswered. In this paper, we settle several open questions in this area. Specifically, we prove that when the length of a non-uniformly spaced seed is bounded by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed in both (i) the average number of non-overlapping hits and (ii) the asymptotic hit probability. Then, we study the computation of the hit probability of a spaced seed, solving three more open questions: (iii) hit probability computation in a uniform homologous region is NP-hard and (iv) it admits a PTAS; (v) the asymptotic hit probability is computable in exponential time in seed length, independent of the homologous region length.

References

[1]
S. F. Altschul et al., Basic local alignment search tool, J. Mol. Biol. 215(1990), pp. 403--410.
[2]
S. F. Altschul et al., Gapped Blast and Psi-Blast: a new generation of protein database search programs, Nucleic Acids Res. 25(1997), pp. 3389--3402.
[3]
N. Balakrishnan and M. V. Koutras, Runs and Scans with Applications, John Wiley & Sons, U.S.A., 2002.
[4]
G. Birkhoff, Extension of Jentzsch's theorem, Trans. Amer. Math. Soc. 85(1957), pp. 219--227.
[5]
B. Brejovà, D. Brown, and T. Vinař, Optimal spaced seeds for homologous coding regions. J. Bioinf. and Comp. Biol. 1(2004), pp. 595--610. Early version appeared in CPM 2003.
[6]
J. Buhler, U. Keich, and Y. Sun, Designing seeds for similarity search in genomic DNA. Proc. 7th Annual Int'l Conf. on Comput. Mol. Biol. (RECOMB'03), pp. 67--75, Berlin, Germany.
[7]
K. P. Choi, and L. Zhang, Sensitivity analysis and efficient method for identifying optimal spaced seeds, J. Comput. Sys. Sci., 68(2004), pp. 22--40.
[8]
K. P. Choi, F. Zeng, and L. Zhang, Good Spaced Seeds for Homology Search, Bioinformatics20(2004), pp. 1053--1059.
[9]
W. Feller, An introduction to Probability Theory and its Applications, Vol. I (3rd edition), Wiley, New York.
[10]
F. Nicolas, and E. Rivals, Hardness of Optimal Spaced Seed Designi, Proc. 16th Annual Symposium Combinatorial Pattern Matching (CPM'05), 2005, pp. 144--155.
[11]
M. Garey, and D. Johnson, A guide to the theory of NP-completeness, W. H. Freeman, 1979.
[12]
V. Gotea, V. Veeramachaneni, and W. Makalowski, Mastering seeds for genomic size nucleotide BLAST searches, Nucleic Acids Res. 31(2003), pp. 6935--6941.
[13]
L. J. Guibas and A. M. Odlyzko, String overlaps, pattern matching, and nontransitive games, J. of Combin. Theory (series A) 30(1981), pp. 183--208.
[14]
Int'l Mouse Genome Sequencing Consortium, Initial sequencing and comparative analysis of the mouse genome. Nature409(2002), pp. 520--562.
[15]
P. Jacquet and W. Szpankowski, Analytic Approach to Pattern Matching, In Applied Combinatorics on Words (Editor: M. Lothaire), Cambridge Press, 2005.
[16]
U. Keich, M. Li, B. Ma, and J. Tromp, On Spaced Seeds for Similarity Search, Discrete Appl. Math. 3(2004), pp. 253--263.
[17]
G. Kucherov, L. Noe, and Y. Ponty, Estimating seed sensitivity on homogeneous alignments, In Proc. IEEE 4th Symp. on Bioinformatics and Bioengineering, Taiwan, 2004, pp. 387--394.
[18]
M. Li, B. Ma, and L. Wang, On the Closest String and Substring Problems, Journal of the ACM49(2002), pp. 157--171.
[19]
M. Li, B. Ma, D. Kisman, and J. Tromp, Pattern-HunterII: highly sensitive and fast homology search. J. Bioinformatics and Comput. Biol. 2(2004), pp. 417--440.
[20]
D. J. Lipman and W. R. Pearson, Rapid and sensitive protein similarity searches, Science227(1985), pp. 1435--1441.
[21]
B. Ma, J. Tromp, and M. Li, PatternHunter: faster and more sensitive homology search, Bioinformatics, 18(2002), pp. 440--445.
[22]
S. R. Mahaney, Sparse complete sets for NP: soluton of a conjecture of Berman and Hartmanis, J. Comput. System Sci. 25(1982), pp. 130--143.
[23]
H. Minc, Nonnegative matrices, John Wiley and Sons, New York, 1988.
[24]
National Center for Biotechnology Information. Growth of GenBank, 2002.
[25]
S. B. Needleman and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol. 48(1970), pp. 443--453.
[26]
P. Nicodéme, B. Salvy, and P. Flajolet, Motif Satistics. Lecture Notes in Computer Sciences, vol. 1643, 1999, pp. 194--211.
[27]
F. P. Preparata, L. Zhang, and K. P. Choi, Quick, practical selection of effective seeds for homology search, To appear in J. Comput. Biology, 2005.
[28]
G. Reinert, S. Schbath, and M. Waterman, Probabilistic and statistical properties of words: An overview, J. Comput. Biol. 7(2000), 1--46.
[29]
S. Schwartz et al. Human-Mouse alignment with BLASTZ, Genome Res.13(2003), pp. 103--107.
[30]
T. F. Smith and M. S. Waterman, Identification of common molecular subsequences, J. Mol. Biol. 147 (1981), pp. 195--197.
[31]
A. D. Solov'ev, A combinatorial identity and its application to the problem concerning the first occurences of a rare event, Theory of Probab. and Appl. 11(1966), pp. 276--282.
[32]
Y. Sun and J. Buhler, Designing multiple simultaneous seeds for DNA similarity search, in Proc. of RECOMB'04, 2004, pp. 76--85.
[33]
J. Xu, D. Brown, M. Li, and B. Ma, Optimizing multiple spaced seeds for homology search, In Proc. of CPM'04, LNCS, vol. 3109, pp. 47--58. Final version to appear in J. Comp. Biol.
[34]
I.-H. Yang et al. Efficient Methods for Generating Optimal Single and Multiple Spaced Seeds, In Proc. IEEE 4th Symp. on Bioinformatics and Bioengineering, Taiwan, 2004, pp. 411--418.
[35]
Z. Zhang, S. Schwartz, L. Wagner, and W. Miller, A greedy algorithm for aligning DNA sequences. J. Comput. Biology. 7(2000), pp. 203--214.

Cited By

View all
  • (2014)Optimizing spaced k-mer neighbors for efficient filtration in protein similarity searchIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2014.230683111:2(398-406)Online publication date: 1-Mar-2014
  • (2012)Probabilistic Arithmetic Automata and Their ApplicationsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2012.1099:6(1737-1750)Online publication date: 1-Nov-2012
  • (2009)A Novel Heuristic for Local Multiple Alignment of Interspersed DNA RepeatsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2009.96:2(180-189)Online publication date: 1-Apr-2009
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SODA '06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm
January 2006
1261 pages
ISBN:0898716055

Sponsors

Publisher

Society for Industrial and Applied Mathematics

United States

Publication History

Published: 22 January 2006

Check for updates

Qualifiers

  • Article

Acceptance Rates

Overall Acceptance Rate 411 of 1,322 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2014)Optimizing spaced k-mer neighbors for efficient filtration in protein similarity searchIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2014.230683111:2(398-406)Online publication date: 1-Mar-2014
  • (2012)Probabilistic Arithmetic Automata and Their ApplicationsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2012.1099:6(1737-1750)Online publication date: 1-Nov-2012
  • (2009)A Novel Heuristic for Local Multiple Alignment of Interspersed DNA RepeatsIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2009.96:2(180-189)Online publication date: 1-Apr-2009
  • (2008)Computing Alignment Seed Sensitivity with Probabilistic Arithmetic AutomataProceedings of the 8th international workshop on Algorithms in Bioinformatics10.1007/978-3-540-87361-7_27(318-329)Online publication date: 15-Sep-2008
  • (2007)Fast computation of good multiple spaced seedsProceedings of the 7th international conference on Algorithms in Bioinformatics10.5555/2391870.2391902(346-358)Online publication date: 8-Sep-2007
  • (2007)Protein similarity search with subset seeds on a dedicated reconfigurable hardwareProceedings of the 7th international conference on Parallel processing and applied mathematics10.5555/1786194.1786340(1240-1248)Online publication date: 9-Sep-2007
  • (2007)New algorithms for the spaced seedsProceedings of the 1st annual international conference on Frontiers in algorithmics10.5555/1776166.1776171(50-61)Online publication date: 1-Aug-2007
  • (2006)Procrastination leads to efficient filtration for local multiple alignmentProceedings of the 6th international conference on Algorithms in Bioinformatics10.5555/2091073.2091085(126-137)Online publication date: 11-Sep-2006

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media