research-article

Permu-pattern: discovery of mutable permutation patterns with proximity constraint

Authors:

Wei SuAuthors Info & Claims

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 318 - 326

https://doi.org/10.1145/1401890.1401932

Published: 24 August 2008 Publication History

Abstract

Pattern discovery in sequences is an important problem in many applications, especially in computational biology and text mining. However, due to the noisy nature of data, the traditional sequential pattern model may fail to reflect the underlying characteristics of sequence data in these applications. There are two challenges: First, the mutation noise exists in the data, and therefore symbols may be misrepresented by other symbols; Secondly, the order of symbols in sequences could be permutated. To address the above problems, in this paper we propose a new sequential pattern model called mutable permutation patterns. Since the Apriori property does not hold for our permutation pattern model, a novel Permu-pattern algorithm is devised to mine frequent mutable permutation patterns from sequence databases. A reachability property is identified to prune the candidate set. Last but not least, we apply the permutation pattern model to a real genome dataset to discover gene clusters, which shows the effectiveness of the model. A large amount of synthetic data is also utilized to demonstrate the efficiency of the Permu-pattern algorithm.

References

[1]

A. Bergeron, J. Stoye. On the Similarity of Sets of Permutations and Its Applications to Genome Comparison. COCOON 2003.

Digital Library

[2]

R. Eres, G.M. Landau, L. Parida. Permutation Pattern Discovery in Biosequences. Journal of Computational Biology, 2004, 11(6):1050--1060.

[3]

R. Agrawal, R. Srikant. Mining Sequential Patterns. ICDE, 1995.

Digital Library

[4]

BLAST, available at http://ncbi.nih.gov/BLAST/.

[5]

S. Cong, J. Han, D.A. Padua. Parallel Mining of Closed Sequential Patterns. KDD, 2005

Digital Library

[6]

G. Deckers-Hebestreit, K. Altendorf. THE F0F1-TYPE ATP SYNTHASES OF BACTERIA: Structure and Function of the F0 Complex. Annual Review of Microbiology, Vol. 50: 791--824.

[7]

R. Durbin, S. Eddy, A. Krogh, G. Mitchison. Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids. Cambridge University Press, 1998.

[8]

C. Fellbaum. WordNet: An Electronic Lexical Database. The MIT Press, 1998.

[9]

M.N. Garofalakis, R. Rastogi, K. Shim. SPIRIT: Sequential Pattern Mining with Regular Expression Constraints. VLDB, 1999.

Digital Library

[10]

D.Graur, W.H. Li. Fundamentals of Molecular Evolution. Ed. Sinauer Associates, Inc., 1991.

[11]

J. Han, J. Pei. Mining Frequent Patterns by Pattern--growth: Methodology and Implications. KDD, 2000.

Digital Library

[12]

S. Heber, J. Stoye. Finding All Common Intervals of k Permutations. CPM, 2001

Digital Library

[13]

M. Hu, J. Yang, and W. Su. Permu--pattern: Discovery of Mutable Permutation Patterns with Proximity Constraint. CWRU EECS Dept. Tech. Report, 2007.

[14]

M. Joshi, G. Karypis, V. Kumar. A Universal Formulation of Sequential Patterns. KDD workshop on Temporal Data Mining, 2001.

[15]

G.M. Landau, L. Parida, O. Weimann. Gene Proximity Analysis Across Whole Genomes via PQ Trees. Journal of Computational Biology, 12(10), pp 1289---1306, 2005.

[16]

R. Overbeek, M. Fonstein, M. D'Souza, G. D. Pusch, N. Maltsev. The Use of Gene Clusters to Infer Functional Coupling. Proc. Natl. Acad. Sci. U.S.A., 96(6):2896´lC2901, 1999.

[17]

J. Pei, J. Han, B. Mortazavi--Asl, H. Pinto, Q. Chen, U. Dayal and M--C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by PrefixProjected Pattern Growth. ICDE, 2001

Digital Library

[18]

J. Pei, J. Han, W.Wang. Mining Sequential Patterns with Constraints in Large Databases. CIKM, 2002.

Digital Library

[19]

J. Pei, J. Liu, H. Wang, K. Wang, P.S. Yu, J. Wang. Efficiently Mining Frequent Closed Partial Orders. ICDM, 2005.

Digital Library

[20]

R. Rymon. Search Through Systematic Set Enumeration. In Int'l. Conf. on Principles of Knowledge Representation and Reasoning, 1992.

Digital Library

[21]

R. Srikant, R. Agrawal. Mining Sequential Patterns: Generalization and Performance Improvements. EDBT, 1996

Digital Library

[22]

T. Schmidt, J. Stoye. Quadratic Time Algorithms for Finding Common Intervals in Two and More Sequences. CPM, 2004

[23]

Roman L. Tatusov, Natalie D. Fedorova, John D. Jackson, Aviva R. Jacobs, Boris Kiryutin, Eugene V. Koonin, Dmitri M. Krylov, Raja Mazumder, Sergei L. Mekhedov, Anastasia N. Nikolskaya, B. Sridhar Rao, Sergei Smirnov, Alexander V. Sverdlov, Sona Vasudevan, Yuri I. Wolf, Jodie J. Yin, Darren A. Natale. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 2003, 4:41

[24]

K. Wang, Y. Xu, J. Xu Yu. Scalable Sequential Pattern Mining for Biological Sequences. CIKM, 2004.

Digital Library

[25]

W. Wang and J. Yang. Mining Sequential Patterns from Large Data Sets. Kluwer Publisher, 2005.

Digital Library

[26]

J. Yang, W.Wang, P. Yu, J. Han. Mining Long Sequential Patterns in a Noisy Environment. SIGMOD, 2002.

Digital Library

[27]

M. J. Zaki, 2001. SPADE: An efficient algorithm for mining frequent sequences. Machine Learning 42, 1/2, 31(60).

Digital Library

Cited By

Pejaver VLee HKim S(2011)Gene Cluster Prediction and Its Application to Genome AnnotationProtein Function Prediction for Omics Era10.1007/978-94-007-0881-5_3(35-54)Online publication date: 29-Mar-2011
https://doi.org/10.1007/978-94-007-0881-5_3
Leung CBrajczuk D(2010)Efficient algorithms for the mining of constrained frequent patterns from uncertain dataACM SIGKDD Explorations Newsletter10.1145/1809400.180942511:2(123-130)Online publication date: 27-May-2010
https://dl.acm.org/doi/10.1145/1809400.1809425
Sheng CHsu WLee MTong JNg S(2010)Mining mutation chains in biological sequences2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)10.1109/ICDE.2010.5447869(473-484)Online publication date: Mar-2010
https://doi.org/10.1109/ICDE.2010.5447869
Show More Cited By

Index Terms

Permu-pattern: discovery of mutable permutation patterns with proximity constraint
1. Computing methodologies
  1. Machine learning

Recommendations

Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach

Sequential pattern mining is an important data mining problem with broad applications. However, it is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the ...
Contiguous item sequential pattern mining using UpDown Tree

In this paper the problem of Contiguous Item Sequential Pattern (CISP) Mining is presented as a sequential pattern mining problem under two constraints. First, each element in a sequence consists of only one item. Second, items appearing in the ...
An UpDown Directed Acyclic Graph Approach for Sequential Pattern Mining

Traditional pattern growth-based approaches for sequential pattern mining derive length-(k+1) patterns based on the projected databases of length-k patterns recursively. At each level of recursion, they unidirectionally grow the length of detected ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2008

1116 pages

ISBN:9781605581934

DOI:10.1145/1401890

General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2008

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD08

Sponsor:

KDD08: The 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2008

Nevada, Las Vegas, USA

Acceptance Rates

KDD '08 Paper Acceptance Rate 118 of 593 submissions, 20%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
387
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Pejaver VLee HKim S(2011)Gene Cluster Prediction and Its Application to Genome AnnotationProtein Function Prediction for Omics Era10.1007/978-94-007-0881-5_3(35-54)Online publication date: 29-Mar-2011
https://doi.org/10.1007/978-94-007-0881-5_3
Leung CBrajczuk D(2010)Efficient algorithms for the mining of constrained frequent patterns from uncertain dataACM SIGKDD Explorations Newsletter10.1145/1809400.180942511:2(123-130)Online publication date: 27-May-2010
https://dl.acm.org/doi/10.1145/1809400.1809425
Sheng CHsu WLee MTong JNg S(2010)Mining mutation chains in biological sequences2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)10.1109/ICDE.2010.5447869(473-484)Online publication date: Mar-2010
https://doi.org/10.1109/ICDE.2010.5447869
Leung CBrajczuk D(2009)Efficient algorithms for mining constrained frequent patterns from uncertain dataProceedings of the 1st ACM SIGKDD Workshop on Knowledge Discovery from Uncertain Data10.1145/1610555.1610557(9-18)Online publication date: 28-Jun-2009
https://dl.acm.org/doi/10.1145/1610555.1610557

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten