Abstract
Lately, there has been considerable interest in applying Data Mining techniques to scientific and data analysis problems in bioinformatics. Data mining research is being fueled by novel application areas that are helping the development of newer applied algorithms in the field of bioinformatics, an emerging discipline representing the integration of biological and information sciences. This is a shift in paradigm from the earlier and the continuing data mining efforts in marketing research and support for business intelligence. The problem described in this paper is along a new dimension in DNA sequence analysis research and supplements the previously studied stochastic models for evolution and variability. The discovery of novel patterns from genetic databases as described is quite significant because biological pattern play an important role in a large variety of cellular processes and constitute the basis for gene therapy. Biological databases containing the genetic codes from a wide variety of organisms, including humans, have continued their exponential growth over the last decade. At the time of this writing, the GenBank database contains over 300 million sequences and over 2.5 billion characters of sequenced nucleotides. The focus of this paper is on developing a general data mining algorithm for discovering regions of locus control, i.e. those regions that are instrumental for activating genes. One type of such elements of locus control are the MARs or the Matrix Association Regions. Our limited knowledge about MARs has hampered their detection using classical pattern recognition techniques. Consequently, their detection is formulated by utilizing a statistical interestingness measure derived from a set of empirical features that are known to be associated with MARs. This paper presents a systematic approach for finding associations between such empirical features in genomic sequences, and for utilizing this knowledge in detecting biologically interesting control signals, such as MARs. This computational MAR discovery tool is implemented as a web-based software called MAR-Wiz and is available for public access. As our knowledge about the living system continues to evolve, and as the biological databases continue to grow, a pattern learning methodology similar to that described in this paper will be significant for the detection of regulatory signals embedded in genomic sequences.
- J. Bode, M. Stengert-Iber, V. Kay, T. Schlake, and A. Dietz-Pfeilstetter. Scaffold/Matrix Attchment Regions: Topological switches with multiple regulatory functions. Crit. Rev. in Eukaryot. Gene Expr., 6(2&3):115--138, 1996.]]Google ScholarCross Ref
- T. Boulikas. Nature of DNA sequences at the attachment regions of genes to the nuclear matrix. J. Cellular Biochemistry, 52:14--22, 1993.]]Google ScholarCross Ref
- N. Cercone and M. Tsuchiya. Special issue on learning and discovery in databases. IEEE Trans. Knowledge & Data Engg., 5(6), Dec. 1993.]] Google ScholarDigital Library
- D. Conklin, S. Fortier, and J. Glasgow. Knowledge discovery in molecular databases. IEEE Trans. Knowledge & Data Engg., 5(6):985--987, 1993.]] Google ScholarDigital Library
- U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knnowledge discovery: An overview. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthuruswamy, editors, Advances in Knowledge Discovery and Data Mining, pages 1--34. AAAI Press, Menlo Park, CA, 1996.]] Google ScholarDigital Library
- W. Frawley, G. Piatetsky-Shapiro, and C. Matheus. Knowldege discovery in databases: An overview. AI Magazine, 14(3):57--70, 1992.]] Google ScholarDigital Library
- D. Gokhale and S. Kullback. The Information in Contingency Tables. New York:M. Dekker, 1978.]]Google Scholar
- D. Hand. Data mining: Statistics and more? The American Statistician, 52(2):112--118, 1998.]]Google ScholarCross Ref
- A. Jarman and D. Higgs. Nuclear scaffold attachment sites in the human globin gene complexes. EMBO Journal, 7(11):3337--3344, 1988.]]Google ScholarCross Ref
- J. Kadonaga. Eukaryotic transcription: An interlaced network of transcription factors and chromatin modifying machines. Cell, 92:307--313, 1998.]]Google ScholarCross Ref
- G. Keen, G. Redgrave, J. Lawton, M. Cinkowsky, S. Mishra, J. Fickett, and C. Burks. Access to molecular biology databases. Mathematical Computer Modeling, 16:93--101, 1992.]]Google ScholarDigital Library
- E. Keogh and P. Smyth. A probabilistic approach to fast pattern matching in time series databases. In Proc. of the 3rd. Int'l Conf. on Knowledge Discovery and Data Mining, pages 24--30, 1997. Menlo Park, CA: AAAI Press.]]Google Scholar
- L. Kliensmith and V. Kish. Principles of cell and molecular biology. HarperCollins, 2nd. edition, 1995.]]Google Scholar
- W. Klosgen. Problems for knowledge discovery in databases and their treatment in the statistics interpretor EXPLORA. Intl. Jou. for Intell. Sys., 7(7):649--673, 1992.]]Google ScholarCross Ref
- W. Klosgen. Efficient discovery of interesting statements in databases. J. Intell. Info. Sys., 4(1):53--69, 1995.]]Google ScholarCross Ref
- J. Kramer, M. Adams, G. Singh, N. Doggett, and S. Krawetz. Extended analysis of the region encompassing the PRM1→PRM2→TNP2 domain: genomic organization, evolution and gene identification. Jou. of Exp. Zoology, 282:245--253, 1998.]]Google ScholarCross Ref
- J. Kramer and S. Krawetz. PCR-Based assay to determine nuclear matrix association. Biotechniques, 22:826--828, 1997.]]Google ScholarCross Ref
- C. Matheus, P. Chan, and G. Piatetsky-Shapiro. Systems of knowledge discovery. IEEE Trans. Knowledge & Data Engg., 5(6):903--913, 1993.]] Google ScholarDigital Library
- G. Piatetsky-shapiro. Special issue on knowledge discovery in databases and knowledgebases. Intl. Jou. for Intell. Sys., 7(7), 1992.]]Google Scholar
- G. Singh, J. Kramer, and S. Krawetz. Mathematical model to predict regions of chromatin attachment to the nuclear matrix. Nucleic Acid Res., 25:1419--1425, 1997.]]Google ScholarCross Ref
- R. Staden. Methods for calculating the probabilities of finding patterns in sequences. Comput. Applic. Biosci., 5(2):89--96, 1988.]]Google Scholar
- M. Tribus. Thermostatics and Thermodynamics. D. van Nostrand Company, Inc., Princeton, N.J., 1961.]]Google Scholar
- J. von Kries, L. Phi-Van, S. Diekmann, and W. Strätling. A non-curved chicken lysozyme 5' matrix attachment site is 3' followed by a strongly curved DNA sequence. Nucleic Acid Res., 18:3881--3885, 1990.]]Google ScholarCross Ref
Index Terms
- Discovering Matrix Attachment Regions (MARs) in genomic databases
Recommendations
An algorithm for the reconstruction of consensus sequences of ancient segmental duplications and transposon copies in eukaryotic genomes
Interspersed repeats, mostly resulting from the activity and accumulation of transposable elements, occupy a significant fraction of many eukaryotic genomes. More than half of human genomic sequence consists of known repeats, however a very large part ...
Genomic distribution and possible functional roles of putative G-quadruplex motifs in two subspecies of Oryza sativa
Display Omitted The research on G-quadrulexes has blossomed for the recent years, but it is fewer in plants by far.Novel rules were used to analyze putative G-quadruplex motifs in plant genomes.The putative G-quadruplex motif is prevailed in this genome ...
Assessment of length distributions between non-coding and coding sequences amongst two model organisms
The availability of genomic DNA and cDNA sequence data has escalated the data mining and genomics era. We aim to investigate the length distributions of the non-coding and coding regions of protein genes of two model organisms, Arabidopsis thaliana and ...
Comments