ABSTRACT
Computationally identifying transcription factor binding sites in the promoter regions of genes is an important problem in computational biology and has been under intensive research for a decade. To predict the binding site locations efficiently, many algorithms that incorporate either approximate or heuristic techniques have been developed. However, the prediction accuracy is not satisfactory and binding site prediction thus remains a challenging problem. In this paper, we develop an approach that can be used to predict binding site motifs using a genetic algorithm. Based on the generic framework of a genetic algorithm, the approach explores the search space of all possible starting locations of the binding site motifs in different target sequences with a population that undergoes evolution. Individuals in the population compete to participate in the crossovers and mutations occur with a certain probability. Initial experiments demonstrated that our approach could achieve high prediction accuracy in a small amount of computation time. A promising advantage of our approach is the fact that the computation time does not explicitly depend on the length of target sequences and hence may not increase significantly when the target sequences become very long.
- Bailey, T.L. and Elkan, C. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, Stanford, CA, AAAI Press, Bethesda, MD, 1994, 28--36.Google Scholar
- Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: A sequence logo generator, Genome Research, 14, 6 (June 2004), 1188--1190.Google ScholarCross Ref
- Eiben, A.E. and Smith, J.E. Introduction to Evolutionary Computing. Springer-Verlag, New York. 2003. Google ScholarDigital Library
- Galas, D.J. and Schmitz, A. A DNA footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res., 5, 9 (Sep. 1978), 3157--3170.Google ScholarCross Ref
- Garner, M. M., and A. Revzin. A gel electrophoresis method for quantifying the binding of proteins to specific DNA regions: application to components of the Escherichia coli lactose operon regulatory system. Nucleic Acids Res. 9, 13 (July, 1981), 3047--3060.Google ScholarCross Ref
- Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, N., Macisaac, K.D., Danford, T.D., Hannett, N.M., Tagne, J.-B., Reynolds, D.B., Yoo, J., Jennings, E.G., Zeitlinger, J., Pokholok, D.K., Kellis, M., Rolfe, P.A., Takusagawa, K.T., Lander, E.S., Gifford, D.K., Fraenkel, E. and Young, R.A. Transcriptional Regulatory Code of a Eukaryotic Genome. Nature, 431, 7004 (Sep. 2004), 99--104.Google ScholarCross Ref
- Hertz, G.Z. and Stormo, G.D. Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics, 15, 7 (July, 1999), 563--577.Google ScholarCross Ref
- Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wooton, J.C. Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science, 262, 8 (Oct. 1993), 208--214.Google ScholarCross Ref
- Liu, F.F.M., Tsai, J.J.P., Chen, R.M., Chen, S.N. and Shih, S.H. FMGA: finding motifs by genetic algorithm. IEEE Fourth Symposium on Bioinformatics and Bioengineering (BIBE 2004), May 2004, 459--466. Google ScholarDigital Library
- Liu, J.S., Neuwald, A.F. and Lawrence, C.E. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc., 90, 432 (Nov. 1995), 1156--1170.Google ScholarCross Ref
- Liu, X., Brutlag, D.L. and Liu, J.S. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput., 6, 2001, 127--138.Google Scholar
- Notredame, C. and Higgins, D.G. SAGA: Sequence alignment by genetic algorithm. Nucleic Acids Res. 24, 8 (Apr. 1996), 1515--1524Google ScholarCross Ref
- Roth, F.R., Hughes, J. D., Estep, P. E. and Church G. M. Finding DNA regulatory motifs within unaligned non-coding sequences clustered by whole-Genome mRNA quantitation. Nature Biotechnology 16, 10 (Oct. 1998), 939--45.Google ScholarCross Ref
- Stormo, G.D. Computer methods for analyzing sequence recognition of nucleic acids. Annu. Rev. BioChem. 17, 1988, 241--263.Google Scholar
- Stormo,G.D. and Hartzell, G.W. Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Acad. Sci. USA, 86, 4, (Feb. 1989), 1183--1187.Google ScholarCross Ref
Index Terms
- MDGA: motif discovery using a genetic algorithm
Recommendations
Finding motifs for insufficient number of sequences with strong binding to transcription facto
RECOMB '04: Proceedings of the eighth annual international conference on Research in computational molecular biologyFinding motifs is an important problem in computational biology. Our paper makes two major contributions to this problem. Firstly, we better characterize the types of problem instances that cannot be solved by most existing methods of finding motifs. ...
Assessment of clustering algorithms for unsupervised transcription factor binding site discovery
Identification of transcription factor binding sites is a key task to understand gene regulation mechanism to discover gene networks and functions. Clustering approach is proved to be useful when finding such patterns residing in promoter regions of co-...
Algorithms for phylogenetic footprinting
RECOMB '01: Proceedings of the fifth annual international conference on Computational biologyPhylogenetic footprinting is a technique that identifies regulatory elements by finding unusually well conserved regions in a set of orthologous non-coding DNA sequences from multiple species. In an earlier paper, we presented an exact algorithm that ...
Comments