|
ABSTRACT
This paper proposes an algorithm to hierarchically cluster documents. Each cluster is actually a cluster of documents and an associated cluster of words, thus a document-word co-cluster. Note that, the vector model for documents creates the document-word matrix, of which every co-cluster is a submatrix. One would intuitively expect a submatrix made up of high values to be a good document cluster, with the corresponding word cluster containing its most distinctive features. Our algorithm looks to exploit this. We have defined matrix density, and our algorithm basically uses matrix density considerations in its working.The algorithm is a partitional-agglomerative algorithm. The partitioning step involves the identification of dense submatrices so that the respective row sets partition the row set of the complete matrix. The hierarchical agglomerative step involves merging the most "similar" submatrices until we are down to the required number of clusters (if we want a flat clustering) or until we have just the single complete matrix left (if we are interested in a hierarchical arrangement of documents). It also generates apt labels for each cluster or hierarchy node. The similarity measure between clusters that we use here for the merging cleverly uses the fact that the clusters here are co-clusters, and is a key point of difference from existing agglomerative algorithms. We will refer to the proposed algorithm as RPSA (Rowset Partitioning and Submatrix Agglomeration). We have compared it as a clustering algorithm with Spherical K-Means and Spectral Graph Partitioning. We have also evaluated some hierarchies generated by the algorithm.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
Hongyuan Zha , Xiaofeng He , Chris Ding , Horst Simon , Ming Gu, Bipartite graph partitioning and data clustering, Proceedings of the tenth international conference on Information and knowledge management, October 05-10, 2001, Atlanta, Georgia, USA
[doi> 10.1145/502585.502591]
|
| |
4
|
Inderjit S. Dhillon, James Fan, and Yuqiang Guan, "Efficient clustering of very large document collections," in Data Mining for Scientific and Engineering Applications, R. Grossman, G. Kamath, and R. Naburu, Eds. Kluwer Academic Publ., 2001.
|
| |
5
|
Inderjit S. Dhillon and Dharmendra S. Modha, "Concept decompositions for large sparse text data using clustering," Tech. Rep. RJ 10147, IBM Almaden Research Center, August 1999.
|
| |
6
|
Michael Steinbach, George Karypis, and Vipin Kumar, "A comparison of document clustering techniques," in KDD Workshop on Text Mining, 2000.
|
| |
7
|
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
Shigeru Oyanagi, Kazuto Kubota, and Akihiko Nakase, "Application of matrix clustering to web log analysis and access prediction," in WEBKDD, August 2001.
|
 |
12
|
|
 |
13
|
|
CITED BY 6
|
|
|
|
Jiangang Ma , Yanchun Zhang , Jing He, Efficiently finding web services using a clustering semantic approach, Proceedings of the 2008 international workshop on Context enabled source and service selection, integration and adaptation: organized with the 17th International World Wide Web Conference (WWW 2008), p.1-8, April 22-22, 2008, Beijing, China
|
|
Krishna Kummamuru , Rohit Lotlikar , Shourya Roy , Karan Singal , Raghu Krishnapuram, A hierarchical monothetic document clustering algorithm for summarization and browsing search results, Proceedings of the 13th international conference on World Wide Web, May 17-20, 2004, New York, NY, USA
|
|
|
|
|
|
|
|
|
|
|