skip to main content
article
Free Access

Discovering Matrix Attachment Regions (MARs) in genomic databases

Published:01 January 2000Publication History
Skip Abstract Section

Abstract

Lately, there has been considerable interest in applying Data Mining techniques to scientific and data analysis problems in bioinformatics. Data mining research is being fueled by novel application areas that are helping the development of newer applied algorithms in the field of bioinformatics, an emerging discipline representing the integration of biological and information sciences. This is a shift in paradigm from the earlier and the continuing data mining efforts in marketing research and support for business intelligence. The problem described in this paper is along a new dimension in DNA sequence analysis research and supplements the previously studied stochastic models for evolution and variability. The discovery of novel patterns from genetic databases as described is quite significant because biological pattern play an important role in a large variety of cellular processes and constitute the basis for gene therapy. Biological databases containing the genetic codes from a wide variety of organisms, including humans, have continued their exponential growth over the last decade. At the time of this writing, the GenBank database contains over 300 million sequences and over 2.5 billion characters of sequenced nucleotides. The focus of this paper is on developing a general data mining algorithm for discovering regions of locus control, i.e. those regions that are instrumental for activating genes. One type of such elements of locus control are the MARs or the Matrix Association Regions. Our limited knowledge about MARs has hampered their detection using classical pattern recognition techniques. Consequently, their detection is formulated by utilizing a statistical interestingness measure derived from a set of empirical features that are known to be associated with MARs. This paper presents a systematic approach for finding associations between such empirical features in genomic sequences, and for utilizing this knowledge in detecting biologically interesting control signals, such as MARs. This computational MAR discovery tool is implemented as a web-based software called MAR-Wiz and is available for public access. As our knowledge about the living system continues to evolve, and as the biological databases continue to grow, a pattern learning methodology similar to that described in this paper will be significant for the detection of regulatory signals embedded in genomic sequences.

References

  1. J. Bode, M. Stengert-Iber, V. Kay, T. Schlake, and A. Dietz-Pfeilstetter. Scaffold/Matrix Attchment Regions: Topological switches with multiple regulatory functions. Crit. Rev. in Eukaryot. Gene Expr., 6(2&3):115--138, 1996.]]Google ScholarGoogle ScholarCross RefCross Ref
  2. T. Boulikas. Nature of DNA sequences at the attachment regions of genes to the nuclear matrix. J. Cellular Biochemistry, 52:14--22, 1993.]]Google ScholarGoogle ScholarCross RefCross Ref
  3. N. Cercone and M. Tsuchiya. Special issue on learning and discovery in databases. IEEE Trans. Knowledge & Data Engg., 5(6), Dec. 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Conklin, S. Fortier, and J. Glasgow. Knowledge discovery in molecular databases. IEEE Trans. Knowledge & Data Engg., 5(6):985--987, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knnowledge discovery: An overview. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthuruswamy, editors, Advances in Knowledge Discovery and Data Mining, pages 1--34. AAAI Press, Menlo Park, CA, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. Frawley, G. Piatetsky-Shapiro, and C. Matheus. Knowldege discovery in databases: An overview. AI Magazine, 14(3):57--70, 1992.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Gokhale and S. Kullback. The Information in Contingency Tables. New York:M. Dekker, 1978.]]Google ScholarGoogle Scholar
  8. D. Hand. Data mining: Statistics and more? The American Statistician, 52(2):112--118, 1998.]]Google ScholarGoogle ScholarCross RefCross Ref
  9. A. Jarman and D. Higgs. Nuclear scaffold attachment sites in the human globin gene complexes. EMBO Journal, 7(11):3337--3344, 1988.]]Google ScholarGoogle ScholarCross RefCross Ref
  10. J. Kadonaga. Eukaryotic transcription: An interlaced network of transcription factors and chromatin modifying machines. Cell, 92:307--313, 1998.]]Google ScholarGoogle ScholarCross RefCross Ref
  11. G. Keen, G. Redgrave, J. Lawton, M. Cinkowsky, S. Mishra, J. Fickett, and C. Burks. Access to molecular biology databases. Mathematical Computer Modeling, 16:93--101, 1992.]]Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. Keogh and P. Smyth. A probabilistic approach to fast pattern matching in time series databases. In Proc. of the 3rd. Int'l Conf. on Knowledge Discovery and Data Mining, pages 24--30, 1997. Menlo Park, CA: AAAI Press.]]Google ScholarGoogle Scholar
  13. L. Kliensmith and V. Kish. Principles of cell and molecular biology. HarperCollins, 2nd. edition, 1995.]]Google ScholarGoogle Scholar
  14. W. Klosgen. Problems for knowledge discovery in databases and their treatment in the statistics interpretor EXPLORA. Intl. Jou. for Intell. Sys., 7(7):649--673, 1992.]]Google ScholarGoogle ScholarCross RefCross Ref
  15. W. Klosgen. Efficient discovery of interesting statements in databases. J. Intell. Info. Sys., 4(1):53--69, 1995.]]Google ScholarGoogle ScholarCross RefCross Ref
  16. J. Kramer, M. Adams, G. Singh, N. Doggett, and S. Krawetz. Extended analysis of the region encompassing the PRM1→PRM2→TNP2 domain: genomic organization, evolution and gene identification. Jou. of Exp. Zoology, 282:245--253, 1998.]]Google ScholarGoogle ScholarCross RefCross Ref
  17. J. Kramer and S. Krawetz. PCR-Based assay to determine nuclear matrix association. Biotechniques, 22:826--828, 1997.]]Google ScholarGoogle ScholarCross RefCross Ref
  18. C. Matheus, P. Chan, and G. Piatetsky-Shapiro. Systems of knowledge discovery. IEEE Trans. Knowledge & Data Engg., 5(6):903--913, 1993.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Piatetsky-shapiro. Special issue on knowledge discovery in databases and knowledgebases. Intl. Jou. for Intell. Sys., 7(7), 1992.]]Google ScholarGoogle Scholar
  20. G. Singh, J. Kramer, and S. Krawetz. Mathematical model to predict regions of chromatin attachment to the nuclear matrix. Nucleic Acid Res., 25:1419--1425, 1997.]]Google ScholarGoogle ScholarCross RefCross Ref
  21. R. Staden. Methods for calculating the probabilities of finding patterns in sequences. Comput. Applic. Biosci., 5(2):89--96, 1988.]]Google ScholarGoogle Scholar
  22. M. Tribus. Thermostatics and Thermodynamics. D. van Nostrand Company, Inc., Princeton, N.J., 1961.]]Google ScholarGoogle Scholar
  23. J. von Kries, L. Phi-Van, S. Diekmann, and W. Strätling. A non-curved chicken lysozyme 5' matrix attachment site is 3' followed by a strongly curved DNA sequence. Nucleic Acid Res., 18:3881--3885, 1990.]]Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Discovering Matrix Attachment Regions (MARs) in genomic databases
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader