|
ABSTRACT
Cluster analysis is a fundamental problem and technique in many areas related to machine learning. In this paper, we consider rearrangement clustering, which is the problem of finding sets of objects that share common or similar features by arranging the rows (objects) of a matrix (specifying object features) in such a way that adjacent objects are similar to each other (based on a similarity measure of the features) so as to maximize the overall similarity. Based on formulating this problem as the Traveling Salesman Problem (TSP), we develop a new TSP-based optimal clustering algorithm called TSPCluster. We overcome a flaw that is inherent in previous approaches by relaxing restrictions on dissimilarities between clusters. Our new algorithm has three important features: finding the optimal k clusters for a given k, automatically detecting cluster borders, and ascertaining a set of most viable clustering results that make good balances among maximizing the overall similarity within clusters and dissimilarity between clusters. We apply TSPCluster to cluster and display ~500 genes of flowering plant Arabidopsis which are regulated under various abiotic stress conditions. We compare TSPCluster to the bond energy algorithm and two existing clustering algorithms. Our TSPCluster code is available at (Climer & Zhang, 2004).
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
Applegate, D., Bixby, R., Chvátal, V., & Cook, W. (web). Concorde - A code for solving Traveling Salesman Problems. 15/12/99 Release, http:www.keck.caam.rice.edu/concorde.html.
|
| |
3
|
Arabie, P., & Hubert, L. (1990). The bond energy algorithm revisited. IEEE Trans. Systems, Man, and Cybernetics, 20, 268--74.
|
| |
4
|
|
| |
5
|
Baldi, P., & Hatfield, G. (2002). Dna microarrays and gene expression. Cambridge University Press.
|
| |
6
|
Bar-Joseph, Z., Demaine, E., Gifford, D., Hamel, A., Jaakkola, T., & Srebro, N. (2003). K-ary clustering with optimal leaf ordering for gene expression data. Bioinformatics, 19, 1070--8.
|
| |
7
|
Climer, S., & Zhang, W. (2004). Rearrangement clustering code. http://www.climer.us or http://www.cse.wustl.edu/~zhang/projects/software.
|
| |
8
|
Eisen, M., Spellman, P., Brown, P., & Botstein, D. (1998). Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy Science, 95, 14863--8.
|
| |
9
|
|
| |
10
|
Jonker, R., & Volgenant, T. (1983). Transforming asymmetric into symmetric traveling salesman problems. Operations Research Letters, 2, 161--163.
|
| |
11
|
Kusiak, A. (1985). Flexible manufacturing systems: A structural approach. Int. J. Production Res., 23, 1057--73.
|
| |
12
|
Lenstra, J. (1974). Clustering a data array and the Traveling Salesman Problem. Operations Research, 22, 413--4.
|
| |
13
|
Lodi, A., & Punnen, A. P. (2002). TSP software. In G. Gutin and A. Punnen (Eds.), The traveling salesman problem and its variations. Norwell, MA: Kluwer Academic.
|
 |
14
|
|
| |
15
|
McCormick, W., Schweitzer, P., & White, T. (1972). Problem decomposition and data reorganization by a clustering technique. Operations Research, 20, 993--1009.
|
| |
16
|
Moscato, P. (web). TSPBIB. http:www.densis.fee.unicamp.br/~moscato/TSPBIB_home.html.
|
| |
17
|
Punnen, A. P. (2002). The Traveling Salesman Problem: applications, formulations, and variations. In G. Gutin and A. Punnen (Eds.), The traveling salesman problem and its variations. Norwell, MA: Kluwer Academic.
|
| |
18
|
Seki, M., Narusaka, M., Ishida, J., Nanjo, T., Oono, Y. F. M., Kamiya, A., Nakajima, M., Enju, A., Sakurai, T., Satou, M., Akiyama, K., Taji, T., Yamaguchi-Shinozaki, K., Carninci, P., Kawai, J., Hayashizaki, Y., & Shinozaki, K. (2002). Monitoring the expression profiles of 7000 arabidopsis genes under drought, cold and high-salinity stresses using a full-length cdna microarray. Plant Journal, 31, 279--92.
|
| |
19
|
Shamir, R., & Sharan, R. (2002). Algorithmic approaches to clustering gene expression data. Current Topics in Computational Biology (pp. 269--299). MIT Press.
|
|