article

Superiority of Spaced Seeds for Homology Search

Author:

Louxin ZhangAuthors Info & Claims

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), Volume 4, Issue 3

Pages 496 - 505

https://doi.org/10.1109/tcbb.2007.1013

Published: 01 July 2007 Publication History

Abstract

In homology search, good spaced seeds have higher sensitivity for the same cost (weight). However, elucidating the mechanism that confers power to spaced seeds and characterizing optimal spaced seeds still remain unsolved. This paper investigates these two important open questions by formally analyzing the average number of non-overlapping hits and the hit probability of a spaced seed in the Bernoulli sequence model. We prove that when the length of a non-uniformly spaced seed is bounded above by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed of the same weight in both (i) the average number of non-overlapping hits and (ii) the asymptotic hit probability. This clearly answers the first problem mentioned above in the Bernoulli sequence model. The theoretical study in this paper also gives a new solution to finding long optimal seeds.

References

[1]

S.F. Altschul et al., “Basic Local Alignment Search Tool,” J.Molecular Biology, vol. 215, pp. 403-410, 1990.

[2]

S.F. Altschul et al., “Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, pp. 3389-3402, 1997.

[3]

N. Balakrishnan and M.V. Koutras, Runs and Scans with Applications. John Wiley & Sons, 2002.

[4]

S. Batzoglou, “The Many Faces of Sequence Alignment,” Briefings in Bioinformatics, vol. 6, pp. 6-22, 2005.

[5]

B. Brejovà, D. Brown, and T. Vinař, “Optimal Spaced Seeds for Homologous Coding Regions,” J. Bioinformatics and Computational Biology, vol. 1, pp. 595-610, 2004.

[6]

J. Buhler, “Efficient Large-Scale Sequence Comparison by Local-Sensitive Hashing,” Bioinformatics, vol. 17, pp. 419-428, 2001.

[7]

J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity Search in Genomic DNA,” Proc. Seventh Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '03), pp. 67-75, 2003.

Digital Library

[8]

S. Burkhardt and J. Karkkainen, “Better Filtering with Gapped q-Grams,” Fundamenta Informaticae XXIII, pp. 1001-1018, 2003.

Digital Library

[9]

A. Califano and I. Rigoutsos, “FLASH: Fast Look-Up Algorithm for String Homology,” technical report, IBM T.J. Watson Research Center, 1995.

[10]

K.P. Choi and L. Zhang, “Sensitivity Analysis and Efficient Method for Identifying Optimal Spaced Seeds,” J. Computer and System Sciences, vol. 68, pp. 22-40, 2004.

Digital Library

[11]

K.P. Choi and L. Zhang, “Analysis of Spaced Seed Technique in Sequence Alignment,” Cosmos, vol. 1, pp. 57-73, 2005.

[12]

K.P. Choi, F. Zeng, and L. Zhang, “Good Spaced Seeds for Homology Search,” Bioinformatics, vol. 20, pp. 1053-1059, 2004.

Digital Library

[13]

W. Feller, An Introduction to Probability Theory and Its Applications, third ed. Wiley, 1968.

[14]

V. Gotea, V. Veeramachaneni, and W. Makalowski, “Mastering Seeds for Genomic Size Nucleotide BLAST Searches,” Nucleic Acids Research, vol. 31, pp. 6935-6941, 2003.

[15]

L.J. Guibas and A.M. Odlyzko, “String Overlaps, Pattern Matching, and Nontransitive Games,” J. Combinational Theory, vol. 30, pp. 183-208, 1981.

[16]

G.H. Hardy, J.E. Littlewood, and G. Pólya, Inequalities, second ed. Cambridge Univ. Press, 1952.

[17]

P. Jacquet and W. Szpankowski, “Analytic Approach to Pattern Matching,” Applied Combinatorics on Words, M. Lothaire, ed., Cambridge Press, 2005.

[18]

U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds for Similarity Search,” Discrete Applied Math., vol. 3, pp. 253-263, 2004.

Digital Library

[19]

W.J. Kent, “BLAT—The BLAST-Like Alignment Tool,” Genome Research, vol. 12, pp. 656-664, 2002.

[20]

W.J. Kent and A.M. Zahler, “Conservation, Regulation, Synteny, and Introns in a Large-Scale C. briggsae-C. elegans Genomic Alignment,” Genome Research, vol. 10, pp. 1115-1125, 2000.

[21]

Y. Kong, “Methods to Find Optimal Multiple Spaced Seeds for Homology Search Using Generalized Correlations,” manuscript, 2005.

[22]

L. Noé and G. Kucherov, “YASS: Enhancing the Sensitivity of DNA Similarity Search,” Nucleic Acids Research, vol. 33, pp. W540-W543, 2005.

[23]

G. Kucherov, L. Noé, and M. Roytberg, “A Unifying Framework for Seed Sensitivity and Its Application to Subset Seeds,” Proc. Fifth Workshop Algorithms in Bioinformatics (WABI '05), 2005.

Digital Library

[24]

S.Y. Li, “A Martingale Approach to the Study of Occurrence of Sequence Patterns in Repeated Experiments,” Annals of Applied Probability, vol. 8, pp. 1171-1176, 1980.

[25]

M. Li and B. Ma, “On the Complexity of Computing the Sensitivity of Spaced Seeds,” manuscript, 2005.

[26]

M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunterII: Highly Sensitive and Fast Homology Search,” J. Bioinformatics and Computational Biology, vol. 2, pp. 417-440, 2004.

[27]

B. Ma, J. Tromp, and M. Li, “PatternHunter-Faster and More Sensitive Homology Search,” Bioinformatics, vol. 18, pp. 440-445, 2002.

[28]

P. Nicodéme, B. Salvy, and P. Flajolet, “Motif Statistics,” Lecture Notes in Computer Sciences, vol. 1643, pp. 194-211, 1999.

Digital Library

[29]

F.P. Preparata, L. Zhang, and K.P. Choi, “Quick, Practical Selection of Effective Seeds for Homology Search,” J. Computational Biology, vol. 12, pp. 1137-1152, 2005.

[30]

G. Reinert, S. Schbath, and M. Waterman, “Probabilistic and Statistical Properties of Words: An Overview,” J. Computational Biology, vol. 7, pp. 1-46, 2000.

[31]

S.J. Schwager, “Run Probabilities in Sequences of Markov-Dependent Trials,” J. Am. Statistics Assoc., vol. 78, pp. 168-175, 1983.

[32]

S. Schwartz et al., “Human-Mouse Alignment with BLASTZ,” Genome Research, vol. 13, pp. 103-107, 2003.

[33]

A.D. Solov'ev, “A Combinatorial Identity and Its Application to the Problem Concerning the First Occurences of a Rare Event,” Theory of Probability and Application, vol. 11, pp. 276-282, 1966.

[34]

Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for DNA Similarity Search,” Proc. ACM-SIAM Int'l Conf. Research in Computational Molecular Biology (RECOMB '04), pp. 76-85, 2004.

Digital Library

[35]

I.-H. Yang, S.-H. Wang, Y.-H. Chen, P.-H. Huang, L. Ye, X. Huang, and K.-M. Chao, “Efficient Methods for Generating Optimal Single and Multiple Spaced Seeds,” Proc. Fourth IEEE Symp. Bioinformatics and Bioeng., pp. 411-418, 2004.

Digital Library

Cited By

Hou MZhang LHarris RRanka SKahveci TSingh M(2012)Alignment seeding strategies using contiguous pyrimidine purine matchesProceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine10.1145/2382936.2382985(384-391)Online publication date: 7-Oct-2012
https://dl.acm.org/doi/10.1145/2382936.2382985
Brodzik A(2010)Rapid sequence homology assessment by subsampling the genome space using difference setsIEEE Transactions on Information Theory10.1109/TIT.2009.203703656:2(756-770)Online publication date: 1-Feb-2010
https://dl.acm.org/doi/10.1109/TIT.2009.2037036

Index Terms

Superiority of Spaced Seeds for Homology Search

Recommendations

Efficient filtration for similarity search with spaced k-mer neighbors
BIBM '12: Proceedings of the 2012 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

In DNA and protein sequence similarity search, seeding (or filtration) has been widely used to trade search sensitivity with search speed. In this paper, a new seeding method, called spaced fc-mer neighbors, is introduced to provide a more efficient ...
On the complexity of the spaced seeds

Optimal spaced seeds were introduced by the theoretical computer science community to bioinformatics to effectively increase homology search sensitivity. These seeds are serving many homology queries daily. However the computational complexity of ...
Optimizing spaced k-mer neighbors for efficient filtration in protein similarity search

Large-scale comparison or similarity search of genomic DNA and protein sequence is of fundamental importance in modern molecular biology. To perform DNA and protein sequence similarity search efficiently, seeding (or filtration) method has been widely ...

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics

IEEE/ACM Transactions on Computational Biology and Bioinformatics Volume 4, Issue 3

July 2007

192 pages

ISSN:1545-5963

Issue’s Table of Contents

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 July 2007

Published in TCBB Volume 4, Issue 3

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
159
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hou MZhang LHarris RRanka SKahveci TSingh M(2012)Alignment seeding strategies using contiguous pyrimidine purine matchesProceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine10.1145/2382936.2382985(384-391)Online publication date: 7-Oct-2012
https://dl.acm.org/doi/10.1145/2382936.2382985
Brodzik A(2010)Rapid sequence homology assessment by subsampling the genome space using difference setsIEEE Transactions on Information Theory10.1109/TIT.2009.203703656:2(756-770)Online publication date: 1-Feb-2010
https://dl.acm.org/doi/10.1109/TIT.2009.2037036

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents