skip to main content
article

Superiority of Spaced Seeds for Homology Search

Published: 01 July 2007 Publication History

Abstract

In homology search, good spaced seeds have higher sensitivity for the same cost (weight). However, elucidating the mechanism that confers power to spaced seeds and characterizing optimal spaced seeds still remain unsolved. This paper investigates these two important open questions by formally analyzing the average number of non-overlapping hits and the hit probability of a spaced seed in the Bernoulli sequence model. We prove that when the length of a non-uniformly spaced seed is bounded above by an exponential function of the seed weight, the seed outperforms strictly the traditional consecutive seed of the same weight in both (i) the average number of non-overlapping hits and (ii) the asymptotic hit probability. This clearly answers the first problem mentioned above in the Bernoulli sequence model. The theoretical study in this paper also gives a new solution to finding long optimal seeds.

References

[1]
S.F. Altschul et al., “Basic Local Alignment Search Tool,” J.Molecular Biology, vol. 215, pp. 403-410, 1990.
[2]
S.F. Altschul et al., “Gapped Blast and Psi-Blast: A New Generation of Protein Database Search Programs,” Nucleic Acids Research, vol. 25, pp. 3389-3402, 1997.
[3]
N. Balakrishnan and M.V. Koutras, Runs and Scans with Applications. John Wiley & Sons, 2002.
[4]
S. Batzoglou, “The Many Faces of Sequence Alignment,” Briefings in Bioinformatics, vol. 6, pp. 6-22, 2005.
[5]
B. Brejovà, D. Brown, and T. Vinař, “Optimal Spaced Seeds for Homologous Coding Regions,” J. Bioinformatics and Computational Biology, vol. 1, pp. 595-610, 2004.
[6]
J. Buhler, “Efficient Large-Scale Sequence Comparison by Local-Sensitive Hashing,” Bioinformatics, vol. 17, pp. 419-428, 2001.
[7]
J. Buhler, U. Keich, and Y. Sun, “Designing Seeds for Similarity Search in Genomic DNA,” Proc. Seventh Ann. Int'l Conf. Research in Computational Molecular Biology (RECOMB '03), pp. 67-75, 2003.
[8]
S. Burkhardt and J. Karkkainen, “Better Filtering with Gapped q-Grams,” Fundamenta Informaticae XXIII, pp. 1001-1018, 2003.
[9]
A. Califano and I. Rigoutsos, “FLASH: Fast Look-Up Algorithm for String Homology,” technical report, IBM T.J. Watson Research Center, 1995.
[10]
K.P. Choi and L. Zhang, “Sensitivity Analysis and Efficient Method for Identifying Optimal Spaced Seeds,” J. Computer and System Sciences, vol. 68, pp. 22-40, 2004.
[11]
K.P. Choi and L. Zhang, “Analysis of Spaced Seed Technique in Sequence Alignment,” Cosmos, vol. 1, pp. 57-73, 2005.
[12]
K.P. Choi, F. Zeng, and L. Zhang, “Good Spaced Seeds for Homology Search,” Bioinformatics, vol. 20, pp. 1053-1059, 2004.
[13]
W. Feller, An Introduction to Probability Theory and Its Applications, third ed. Wiley, 1968.
[14]
V. Gotea, V. Veeramachaneni, and W. Makalowski, “Mastering Seeds for Genomic Size Nucleotide BLAST Searches,” Nucleic Acids Research, vol. 31, pp. 6935-6941, 2003.
[15]
L.J. Guibas and A.M. Odlyzko, “String Overlaps, Pattern Matching, and Nontransitive Games,” J. Combinational Theory, vol. 30, pp. 183-208, 1981.
[16]
G.H. Hardy, J.E. Littlewood, and G. Pólya, Inequalities, second ed. Cambridge Univ. Press, 1952.
[17]
P. Jacquet and W. Szpankowski, “Analytic Approach to Pattern Matching,” Applied Combinatorics on Words, M. Lothaire, ed., Cambridge Press, 2005.
[18]
U. Keich, M. Li, B. Ma, and J. Tromp, “On Spaced Seeds for Similarity Search,” Discrete Applied Math., vol. 3, pp. 253-263, 2004.
[19]
W.J. Kent, “BLAT—The BLAST-Like Alignment Tool,” Genome Research, vol. 12, pp. 656-664, 2002.
[20]
W.J. Kent and A.M. Zahler, “Conservation, Regulation, Synteny, and Introns in a Large-Scale C. briggsae-C. elegans Genomic Alignment,” Genome Research, vol. 10, pp. 1115-1125, 2000.
[21]
Y. Kong, “Methods to Find Optimal Multiple Spaced Seeds for Homology Search Using Generalized Correlations,” manuscript, 2005.
[22]
L. Noé and G. Kucherov, “YASS: Enhancing the Sensitivity of DNA Similarity Search,” Nucleic Acids Research, vol. 33, pp. W540-W543, 2005.
[23]
G. Kucherov, L. Noé, and M. Roytberg, “A Unifying Framework for Seed Sensitivity and Its Application to Subset Seeds,” Proc. Fifth Workshop Algorithms in Bioinformatics (WABI '05), 2005.
[24]
S.Y. Li, “A Martingale Approach to the Study of Occurrence of Sequence Patterns in Repeated Experiments,” Annals of Applied Probability, vol. 8, pp. 1171-1176, 1980.
[25]
M. Li and B. Ma, “On the Complexity of Computing the Sensitivity of Spaced Seeds,” manuscript, 2005.
[26]
M. Li, B. Ma, D. Kisman, and J. Tromp, “PatternHunterII: Highly Sensitive and Fast Homology Search,” J. Bioinformatics and Computational Biology, vol. 2, pp. 417-440, 2004.
[27]
B. Ma, J. Tromp, and M. Li, “PatternHunter-Faster and More Sensitive Homology Search,” Bioinformatics, vol. 18, pp. 440-445, 2002.
[28]
P. Nicodéme, B. Salvy, and P. Flajolet, “Motif Statistics,” Lecture Notes in Computer Sciences, vol. 1643, pp. 194-211, 1999.
[29]
F.P. Preparata, L. Zhang, and K.P. Choi, “Quick, Practical Selection of Effective Seeds for Homology Search,” J. Computational Biology, vol. 12, pp. 1137-1152, 2005.
[30]
G. Reinert, S. Schbath, and M. Waterman, “Probabilistic and Statistical Properties of Words: An Overview,” J. Computational Biology, vol. 7, pp. 1-46, 2000.
[31]
S.J. Schwager, “Run Probabilities in Sequences of Markov-Dependent Trials,” J. Am. Statistics Assoc., vol. 78, pp. 168-175, 1983.
[32]
S. Schwartz et al., “Human-Mouse Alignment with BLASTZ,” Genome Research, vol. 13, pp. 103-107, 2003.
[33]
A.D. Solov'ev, “A Combinatorial Identity and Its Application to the Problem Concerning the First Occurences of a Rare Event,” Theory of Probability and Application, vol. 11, pp. 276-282, 1966.
[34]
Y. Sun and J. Buhler, “Designing Multiple Simultaneous Seeds for DNA Similarity Search,” Proc. ACM-SIAM Int'l Conf. Research in Computational Molecular Biology (RECOMB '04), pp. 76-85, 2004.
[35]
I.-H. Yang, S.-H. Wang, Y.-H. Chen, P.-H. Huang, L. Ye, X. Huang, and K.-M. Chao, “Efficient Methods for Generating Optimal Single and Multiple Spaced Seeds,” Proc. Fourth IEEE Symp. Bioinformatics and Bioeng., pp. 411-418, 2004.

Cited By

View all
  • (2012)Alignment seeding strategies using contiguous pyrimidine purine matchesProceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine10.1145/2382936.2382985(384-391)Online publication date: 7-Oct-2012
  • (2010)Rapid sequence homology assessment by subsampling the genome space using difference setsIEEE Transactions on Information Theory10.1109/TIT.2009.203703656:2(756-770)Online publication date: 1-Feb-2010

Recommendations

Comments

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Computational Biology and Bioinformatics
IEEE/ACM Transactions on Computational Biology and Bioinformatics  Volume 4, Issue 3
July 2007
192 pages

Publisher

IEEE Computer Society Press

Washington, DC, United States

Publication History

Published: 01 July 2007
Published in TCBB Volume 4, Issue 3

Author Tags

  1. Homology search
  2. pattern matching
  3. renewal theory
  4. run statistics
  5. sequence alignment
  6. spaced seeds

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)3
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2012)Alignment seeding strategies using contiguous pyrimidine purine matchesProceedings of the ACM Conference on Bioinformatics, Computational Biology and Biomedicine10.1145/2382936.2382985(384-391)Online publication date: 7-Oct-2012
  • (2010)Rapid sequence homology assessment by subsampling the genome space using difference setsIEEE Transactions on Information Theory10.1109/TIT.2009.203703656:2(756-770)Online publication date: 1-Feb-2010

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media