ACM Home Page
Please provide us with feedback. Feedback
Mining periodic patterns with gap requirement from sequences
Full text PdfPdf (452 KB)
Source International Conference on Management of Data archive
Proceedings of the 2005 ACM SIGMOD international conference on Management of data table of contents
Baltimore, Maryland
SESSION: Research papers: stream and sequence mining table of contents
Pages: 623 - 633  
Year of Publication: 2005
ISBN:1-59593-060-4
Authors
Minghua Zhang  The University of Hong Kong, Hong Kong
Ben Kao  The University of Hong Kong, Hong Kong
David W. Cheung  The University of Hong Kong, Hong Kong
Kevin Y. Yip  The University of Hong Kong, Hong Kong
Sponsors
ACM: Association for Computing Machinery
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 68,   Citation Count: 4
Additional Information:

abstract   references   cited by   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1066157.1066228
What is a DOI?

ABSTRACT

We study a problem of mining frequently occurring periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length l, we consider P a frequently occurring pattern in S if the probability of observing P given a randomly picked length-l subsequence of S exceeds a certain threshold. In many applications, particularly those related to bioinformatics, interesting patterns are periodic with a gap requirement. That is to say, the characters in P should match subsequences of S in such a way that the matching characters in S are separated by gaps of more or less the same size. We show the complexity of the mining problem and discuss why traditional mining algorithms are computationally infeasible. We propose practical algorithms for solving the problem, and study their characteristics. We also present a case study in which we apply our algorithms on some DNA sequences. We discuss some interesting patterns obtained from the case study.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403--410, 1990.
 
2
A. Bairoch and B. Boeckmann. The swiss-prot protein sequence data bank. Nucleic Acids Research, 20(Suppl):2019--2022, 1992.
 
3
Giorgio Bernardi, Birgitta Olofsson, Jan Filipski, Marino Zerial, Julio Salinas, Gerard Cuny, Michele Meunier-Rotival, and Francis Rodier. The mosaic genome of warm-blooded vertebrates. Science, 228(4702):953--958, 1985.
 
4
Eivind Coward and Finn Drablos. Detecting periodic patterns in biological sequences. Bioinformatics, 14(6):498--507, 1998.
 
5
J. W. Fickett and C. S. Tung. Assessment of protein coding measures. Nucleuic Acids Research, 20:6441--6450, 1992.
 
6
 
7
H. Herzel, O. Weiss, and E. N. Trifonov. 10--11 bp periodicities in complete genomes reflect protein structure and DNA folding. Bioinformatics, 15(3):187--193, 1999.
 
8
Inge Jonassen. Efficient discovery of conserved patterns using a pattern graph. Technical Report Report No. 118, University of Bergen, 1996.
 
9
 
10
 
11
 
12
13
 
14
P. S. Reddy and D. E. Housman. The complex pathology of trinucleotide repeats. Current Opinion in Cell Biology, 9(3):364--372, 1997.
 
15
Isidore Rigoutsos and Aris Floratos. Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics, 14(1), 1998.
 
16
 
17
Masaru Tomita, Masahiko Wada, and Yukihiro Kawashima. ApA dinucleotide periodicity in prokaryote, eukaryote, and organelle genomes. Journal of Molecular Evolution, 49:182--192, 1999.
 
18
E. N. Trifonov. 3-, 10.5-, 200- and 400-base periodicities in genome sequences. Physica A, 249:511--516, 1998.
 
19
A. van Belkum, S. Scherer amd W. van Leeuwen, D. Willemse, L. van Alphen, and H. Verbrugh. Variable number of tandem repeats in clinical strains of haemophilus influenzae. Infection and Immunity, 65(12):5017--5027, 1997.
 
20
J. Widom. Short-range order in two eukaryotic genomes: Relation to chromosome structure. Journal of Moleular Biology, 259:579--588, 1996.
21
22
 
23
Minghua Zhang, Ben Kao, David W. Cheung, and Kevin Y. Yip. Mining periodic patterns with gap requirement from sequences. Technical Report TR-2005-04, Department of Computer Science, The University of Hong Kong, 2005.
 
24
Minghua Zhang, Ben Kao, C. L. Yip, and David Cheung. A GSP-based efficient algorithm for mining frequent sequences. In Proc. of IC-AI'2001, Las Vegas, Nevada, USA, June 2001.

Collaborative Colleagues:
Minghua Zhang: colleagues
Ben Kao: colleagues
David W. Cheung: colleagues
Kevin Y. Yip: colleagues