ABSTRACT
An important part of deciphering gene regulatory mechanisms is discovering transcription factor binding sites. In many cases, these sites can be detected because they are often overrepresented in genomic sequences. The detection of the overrepresented signals in sequences, or motif-finding has become a central problem in computational biology. There are two major computational frameworks for attacking the motif finding problem which differ in their representation of the signals. The most popular is the profile or PSSM (Position Specific Scoring Matrix) representation. The goal of these algorithms is to obtain probabilistic representations of the overrepresented signals. Another is the consensus pattern or pattern with mismatches representation which represents a signal as discrete consensus pattern and allows some mismatches to occur in each instance of the pattern. The advantage of profiles is the expressiveness of their representation while the advantage of the consensus pattern approach is the existence of efficient algorithms that guarantee discovery of the best patterns. In this paper we present a unified framework for motif finding which encompasses both the profile representation and the consensus pattern representation. We prove that the problem of discovering the best profiles can be solved by considering a degenerate version of the problem of finding the best consensus patterns. The main advantage of our framework is that it motivates a novel algorithm, MITRA-PSSM, which discovers profiles, yet provides some of the guarantees of discovering the best signals. The algorithm searches for best profiles with respect to information content which is the same criterion of popular algorithms such as MEME and CONSENSUS. MITRA-PSSM is specifically designed for searching for profiles in this framework and introduces a novel notion of scoring consensus patterns, discrete information content. MITRA-PSSM is available for public use via webserver at http://www.calit2.net/compbio/mitra/.
- T. L. Bailey and C. Elkan. Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21:51, 1995.]] Google ScholarDigital Library
- Yoseph Barash, Gill Bejerano, and Nir Friedman. A simple hyper-geometric approach for discovering putative transcription factor binding sites. In Proceedings of WABI, 2001.]] Google ScholarDigital Library
- O.G. Berg and P. H. von Hippel. Selection of DNA binding sites by regulatory proteins. Trends Biochem Sci., 13:207--211, Jun 1998.]]Google ScholarCross Ref
- J. Buhler and M. Tompa. Finding motifs using random projections. In Proceedings of the Fifth Annual International Conference on Computational Molecular Biology (RECOMB01), pages 69--76, 2001.]] Google ScholarDigital Library
- A. P. Dempster, N. Laird, and D. Rubin. Maximum likelihood for incomplete data via the EM algorithm. Journal of the Royal Statistical Society, ser. B, 39:1--38, 1977.]]Google Scholar
- Alain Denise, Mireille Régnier, and Mathias Vandenbogaert. Assessing the statistical significance of overrepresented oligonucleotides. In Proceedings of WABI, 2001.]] Google ScholarDigital Library
- E. Eskin, U. Keich, M.S. Gelfand, and P.A. Pevzner. Genome wide analysis of bacterial promoter regions. In Proceedings of the 2003 Pacific Symposium on Biocomputing, 2003.]]Google Scholar
- E. Eskin and P. A. Pevzner. Finding composite regulatory patterns in dna sequences. In Special Issue Proceedings of the Tenth International Conference on Intelligent Systems for Molecular Biology (ISMB-2002) Bioinformatics., pages 1:S354--63, 2002.]]Google ScholarCross Ref
- D. J. Galas, M. Eggert, and M. S. Waterman. Rigorous pattern--recognition methods for DNA sequences. Journal of Molecular Biology, 186:117--128, 1985.]]Google ScholarCross Ref
- G. Hertz and G. Stormo. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15:563--577, 1999.]]Google ScholarCross Ref
- J.D. Hughes, P.W. Estep, S. Tavazoie, and G.M. Church. Computational identification of cis-regulatory elements associated with groups of functionally related genes in saccharomyces cerevisiae. Journal of Molecular Biology, 296(5):1205--14, 2000.]]Google ScholarCross Ref
- U. Keich and P. A. Pevzner. Finding motifs in the twilight zone. In Proceedings of the 6th International Conference on Computational Molecular Biology (RECOMB 2002), 2002.]] Google ScholarDigital Library
- C. E. Lawrence, S. F. Altschul, M. S. Boguski, J. S. Liu, A. F. Neuwald, and J. C. Wootton. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262:208--214, 1993.]]Google ScholarCross Ref
- A. Neuwald, J. Liu, and C. Lawrence. Gibbs motif sampling: Detection of bacterial outer membrane repeats. Protein Science, 4:1618--1632, 1995.]]Google ScholarCross Ref
- P. A. Pevzner and S. Sze. Combinatorial approaches to finding subtle signals in DNA sequences. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, pages 269--278, 2000.]] Google ScholarDigital Library
- M. Sagot. Spelling approximate or repeated motifs using a suffix tree. Lecture Notes in Computer Science, 1380:111--127, 1998.]] Google ScholarDigital Library
- E. Segal, Y. Barash, I. Simon, N. Friedman, and D. Koller. From promoter sequence to expression: A probabilistic framework. In Proceedings of the 6th International Conference on Research in Computational Molecular Biology (RECOMB), 2002.]] Google ScholarDigital Library
- S. Sinha and M. Tompa. Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research, 30:5549--5560, 2002.]]Google ScholarCross Ref
- S. Sze, M. S. Gelfand, and P. A. Pevzner. Finding subtle motifs in DNA sequences. In Proceedings of Pacific Symposium on Biocomputing, pages 235--246, 2002.]]Google Scholar
- M.S. Waterman, R. Arratia, and D.J. Galas. Pattern recognition in several sequences: consensus and alignm ent. Bulletin of Mathematical Biology, 46:515--527, 1984.]]Google ScholarCross Ref
Index Terms
- From profiles to patterns and back again: a branch and bound algorithm for finding near optimal motif profiles
Recommendations
Identification of specific sequence motifs in the upstream region of 242 human miRNA genes
We have identified novel over-represented and conserved motifs in the upstream regions of human and mouse miRNA stem-loop sequences by means of a new bioinformatic processing regimen. We observed sequence conservation -500bp upstream in 189 human and ...
Finding functional promoter motifs by computational methods: a word of caution
The standard practice in the analysis of promoters is to select promoter regions of convenient length. This may lead to false results when searching for Transcription Factor Binding Sites (TFBSs), since the sequences may contain coding segments. In such ...
Pathway to functional studies: pipeline linking phylogenetic footprinting and transcription-factor binding analysis
WISB '06: Proceedings of the 2006 workshop on Intelligent systems for bioinformatics - Volume 73Identification of transcription-factor binding sites is a critical first step in studying transcriptional regulation of genes. The comparative genomics method of phylogenetic footprinting is based on identifying sequence elements that are conserved ...
Comments