|
ABSTRACT
We study a problem of mining frequently occurring periodic patterns with a gap requirement from sequences. Given a character sequence S of length L and a pattern P of length l, we consider P a frequently occurring pattern in S if the probability of observing P given a randomly picked length-l subsequence of S exceeds a certain threshold. In many applications, particularly those related to bioinformatics, interesting patterns are periodic with a gap requirement. That is to say, the characters in P should match subsequences of S in such a way that the matching characters in S are separated by gaps of more or less the same size. We show the complexity of the mining problem and discuss why traditional mining algorithms are computationally infeasible. We propose practical algorithms for solving the problem, and study their characteristics. We also present a case study in which we apply our algorithms on some DNA sequences. We discuss some interesting patterns obtained from the case study.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. Basic local alignment search tool. Journal of Molecular Biology, 215:403--410, 1990.
|
| |
2
|
A. Bairoch and B. Boeckmann. The swiss-prot protein sequence data bank. Nucleic Acids Research, 20(Suppl):2019--2022, 1992.
|
| |
3
|
Giorgio Bernardi, Birgitta Olofsson, Jan Filipski, Marino Zerial, Julio Salinas, Gerard Cuny, Michele Meunier-Rotival, and Francis Rodier. The mosaic genome of warm-blooded vertebrates. Science, 228(4702):953--958, 1985.
|
| |
4
|
Eivind Coward and Finn Drablos. Detecting periodic patterns in biological sequences. Bioinformatics, 14(6):498--507, 1998.
|
| |
5
|
J. W. Fickett and C. S. Tung. Assessment of protein coding measures. Nucleuic Acids Research, 20:6441--6450, 1992.
|
| |
6
|
|
| |
7
|
H. Herzel, O. Weiss, and E. N. Trifonov. 10--11 bp periodicities in complete genomes reflect protein structure and DNA folding. Bioinformatics, 15(3):187--193, 1999.
|
| |
8
|
Inge Jonassen. Efficient discovery of conserved patterns using a pattern graph. Technical Report Report No. 118, University of Bergen, 1996.
|
| |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
Jian Pei , Jiawei Han , Behzad Mortazavi-Asl , Helen Pinto , Qiming Chen , Umeshwar Dayal , Meichun Hsu, PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth, Proceedings of the 17th International Conference on Data Engineering, p.215-224, April 02-06, 2001
|
 |
13
|
Rakesh Agrawal , Tomasz Imieliński , Arun Swami, Mining association rules between sets of items in large databases, Proceedings of the 1993 ACM SIGMOD international conference on Management of data, p.207-216, May 25-28, 1993, Washington, D.C., United States
|
| |
14
|
P. S. Reddy and D. E. Housman. The complex pathology of trinucleotide repeats. Current Opinion in Cell Biology, 9(3):364--372, 1997.
|
| |
15
|
Isidore Rigoutsos and Aris Floratos. Combinatorial pattern discovery in biological sequences: the teiresias algorithm. Bioinformatics, 14(1), 1998.
|
| |
16
|
|
| |
17
|
Masaru Tomita, Masahiko Wada, and Yukihiro Kawashima. ApA dinucleotide periodicity in prokaryote, eukaryote, and organelle genomes. Journal of Molecular Evolution, 49:182--192, 1999.
|
| |
18
|
E. N. Trifonov. 3-, 10.5-, 200- and 400-base periodicities in genome sequences. Physica A, 249:511--516, 1998.
|
| |
19
|
A. van Belkum, S. Scherer amd W. van Leeuwen, D. Willemse, L. van Alphen, and H. Verbrugh. Variable number of tandem repeats in clinical strains of haemophilus influenzae. Infection and Immunity, 65(12):5017--5027, 1997.
|
| |
20
|
J. Widom. Short-range order in two eukaryotic genomes: Relation to chromosome structure. Journal of Moleular Biology, 259:579--588, 1996.
|
 |
21
|
Jiong Yang , Wei Wang , Philip S. Yu, Mining asynchronous periodic patterns in time series data, Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, p.275-279, August 20-23, 2000, Boston, Massachusetts, United States
[doi> 10.1145/347090.347150]
|
 |
22
|
|
| |
23
|
Minghua Zhang, Ben Kao, David W. Cheung, and Kevin Y. Yip. Mining periodic patterns with gap requirement from sequences. Technical Report TR-2005-04, Department of Computer Science, The University of Hong Kong, 2005.
|
| |
24
|
Minghua Zhang, Ben Kao, C. L. Yip, and David Cheung. A GSP-based efficient algorithm for mining frequent sequences. In Proc. of IC-AI'2001, Las Vegas, Nevada, USA, June 2001.
|
|