article

Free Access

Discovering Matrix Attachment Regions (MARs) in genomic databases

Author:
Gautam B. Singh

Oakland University, Rochester, MI

Oakland University, Rochester, MI
View Profile

Authors Info & Claims

ACM SIGKDD Explorations Newsletter Volume 1 Issue 2January 2000pp 39–45https://doi.org/10.1145/846183.846184

Published:01 January 2000Publication History

ACM SIGKDD Explorations Newsletter

Abstract

Lately, there has been considerable interest in applying Data Mining techniques to scientific and data analysis problems in bioinformatics. Data mining research is being fueled by novel application areas that are helping the development of newer applied algorithms in the field of bioinformatics, an emerging discipline representing the integration of biological and information sciences. This is a shift in paradigm from the earlier and the continuing data mining efforts in marketing research and support for business intelligence. The problem described in this paper is along a new dimension in DNA sequence analysis research and supplements the previously studied stochastic models for evolution and variability. The discovery of novel patterns from genetic databases as described is quite significant because biological pattern play an important role in a large variety of cellular processes and constitute the basis for gene therapy. Biological databases containing the genetic codes from a wide variety of organisms, including humans, have continued their exponential growth over the last decade. At the time of this writing, the GenBank database contains over 300 million sequences and over 2.5 billion characters of sequenced nucleotides. The focus of this paper is on developing a general data mining algorithm for discovering regions of locus control, i.e. those regions that are instrumental for activating genes. One type of such elements of locus control are the MARs or the Matrix Association Regions. Our limited knowledge about MARs has hampered their detection using classical pattern recognition techniques. Consequently, their detection is formulated by utilizing a statistical interestingness measure derived from a set of empirical features that are known to be associated with MARs. This paper presents a systematic approach for finding associations between such empirical features in genomic sequences, and for utilizing this knowledge in detecting biologically interesting control signals, such as MARs. This computational MAR discovery tool is implemented as a web-based software called MAR-Wiz and is available for public access. As our knowledge about the living system continues to evolve, and as the biological databases continue to grow, a pattern learning methodology similar to that described in this paper will be significant for the detection of regulatory signals embedded in genomic sequences.

References

J. Bode, M. Stengert-Iber, V. Kay, T. Schlake, and A. Dietz-Pfeilstetter. Scaffold/Matrix Attchment Regions: Topological switches with multiple regulatory functions. Crit. Rev. in Eukaryot. Gene Expr., 6(2&3):115--138, 1996.]]Google ScholarCross Ref
T. Boulikas. Nature of DNA sequences at the attachment regions of genes to the nuclear matrix. J. Cellular Biochemistry, 52:14--22, 1993.]]Google ScholarCross Ref
N. Cercone and M. Tsuchiya. Special issue on learning and discovery in databases. IEEE Trans. Knowledge & Data Engg., 5(6), Dec. 1993.]] Google ScholarDigital Library
D. Conklin, S. Fortier, and J. Glasgow. Knowledge discovery in molecular databases. IEEE Trans. Knowledge & Data Engg., 5(6):985--987, 1993.]] Google ScholarDigital Library
U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. From data mining to knnowledge discovery: An overview. In U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthuruswamy, editors, Advances in Knowledge Discovery and Data Mining, pages 1--34. AAAI Press, Menlo Park, CA, 1996.]] Google ScholarDigital Library
W. Frawley, G. Piatetsky-Shapiro, and C. Matheus. Knowldege discovery in databases: An overview. AI Magazine, 14(3):57--70, 1992.]] Google ScholarDigital Library
D. Gokhale and S. Kullback. The Information in Contingency Tables. New York:M. Dekker, 1978.]]Google Scholar
D. Hand. Data mining: Statistics and more? The American Statistician, 52(2):112--118, 1998.]]Google ScholarCross Ref
A. Jarman and D. Higgs. Nuclear scaffold attachment sites in the human globin gene complexes. EMBO Journal, 7(11):3337--3344, 1988.]]Google ScholarCross Ref
J. Kadonaga. Eukaryotic transcription: An interlaced network of transcription factors and chromatin modifying machines. Cell, 92:307--313, 1998.]]Google ScholarCross Ref
G. Keen, G. Redgrave, J. Lawton, M. Cinkowsky, S. Mishra, J. Fickett, and C. Burks. Access to molecular biology databases. Mathematical Computer Modeling, 16:93--101, 1992.]]Google ScholarDigital Library
E. Keogh and P. Smyth. A probabilistic approach to fast pattern matching in time series databases. In Proc. of the 3rd. Int'l Conf. on Knowledge Discovery and Data Mining, pages 24--30, 1997. Menlo Park, CA: AAAI Press.]]Google Scholar
L. Kliensmith and V. Kish. Principles of cell and molecular biology. HarperCollins, 2nd. edition, 1995.]]Google Scholar
W. Klosgen. Problems for knowledge discovery in databases and their treatment in the statistics interpretor EXPLORA. Intl. Jou. for Intell. Sys., 7(7):649--673, 1992.]]Google ScholarCross Ref
W. Klosgen. Efficient discovery of interesting statements in databases. J. Intell. Info. Sys., 4(1):53--69, 1995.]]Google ScholarCross Ref
J. Kramer, M. Adams, G. Singh, N. Doggett, and S. Krawetz. Extended analysis of the region encompassing the PRM1→PRM2→TNP2 domain: genomic organization, evolution and gene identification. Jou. of Exp. Zoology, 282:245--253, 1998.]]Google ScholarCross Ref
J. Kramer and S. Krawetz. PCR-Based assay to determine nuclear matrix association. Biotechniques, 22:826--828, 1997.]]Google ScholarCross Ref
C. Matheus, P. Chan, and G. Piatetsky-Shapiro. Systems of knowledge discovery. IEEE Trans. Knowledge & Data Engg., 5(6):903--913, 1993.]] Google ScholarDigital Library
G. Piatetsky-shapiro. Special issue on knowledge discovery in databases and knowledgebases. Intl. Jou. for Intell. Sys., 7(7), 1992.]]Google Scholar
G. Singh, J. Kramer, and S. Krawetz. Mathematical model to predict regions of chromatin attachment to the nuclear matrix. Nucleic Acid Res., 25:1419--1425, 1997.]]Google ScholarCross Ref
R. Staden. Methods for calculating the probabilities of finding patterns in sequences. Comput. Applic. Biosci., 5(2):89--96, 1988.]]Google Scholar
M. Tribus. Thermostatics and Thermodynamics. D. van Nostrand Company, Inc., Princeton, N.J., 1961.]]Google Scholar
J. von Kries, L. Phi-Van, S. Diekmann, and W. Strätling. A non-curved chicken lysozyme 5' matrix attachment site is 3' followed by a strongly curved DNA sequence. Nucleic Acid Res., 18:3881--3885, 1990.]]Google ScholarCross Ref

Index Terms

Discovering Matrix Attachment Regions (MARs) in genomic databases

Index terms have been assigned to the content through auto-classification.

Recommendations

An algorithm for the reconstruction of consensus sequences of ancient segmental duplications and transposon copies in eukaryotic genomes

Interspersed repeats, mostly resulting from the activity and accumulation of transposable elements, occupy a significant fraction of many eukaryotic genomes. More than half of human genomic sequence consists of known repeats, however a very large part ...
Read More
Genomic distribution and possible functional roles of putative G-quadruplex motifs in two subspecies of Oryza sativa

Display Omitted The research on G-quadrulexes has blossomed for the recent years, but it is fewer in plants by far.Novel rules were used to analyze putative G-quadruplex motifs in plant genomes.The putative G-quadruplex motif is prevailed in this genome ...
Read More
Assessment of length distributions between non-coding and coding sequences amongst two model organisms

The availability of genomic DNA and cDNA sequence data has escalated the data mining and genomics era. We aim to investigate the length distributions of the non-coding and coding regions of protein genes of two model organisms, Arabidopsis thaliana and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGKDD Explorations Newsletter Volume 1, Issue 2
January 2000
115 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/846183
Issue’s Table of Contents

Copyright © 2000 Author
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 January 2000
Check for updates
Author Tags
DNA Sequence Analysis
MARs
Matrix Attachment Regions
bioinformatics
data mining
gene therapy
medical data mining
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 610
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Discovering Matrix Attachment Regions (MARs) in genomic databases

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

An algorithm for the reconstruction of consensus sequences of ancient segmental duplications and transposon copies in eukaryotic genomes

Genomic distribution and possible functional roles of putative G-quadruplex motifs in two subspecies of Oryza sativa

Assessment of length distributions between non-coding and coding sequences amongst two model organisms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Discovering Matrix Attachment Regions (MARs) in genomic databases

ACM SIGKDD Explorations Newsletter

Abstract

References

Cited By

Index Terms

Recommendations

An algorithm for the reconstruction of consensus sequences of ancient segmental duplications and transposon copies in eukaryotic genomes

Genomic distribution and possible functional roles of putative G-quadruplex motifs in two subspecies of Oryza sativa

Assessment of length distributions between non-coding and coding sequences amongst two model organisms

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media