ACM Home Page
Please provide us with feedback. Feedback
A matrix density based algorithm to hierarchically co-cluster documents and words
Full text PdfPdf (133 KB)
Source International World Wide Web Conference archive
Proceedings of the 12th international conference on World Wide Web table of contents
Budapest, Hungary
SESSION: Data mining table of contents
Pages: 511 - 518  
Year of Publication: 2003
ISBN:1-58113-680-3
Authors
Bhushan Mandhani  Indian Institute of Technology, Bombay, India
Sachindra Joshi  IBM India Research Lab, New Delhi, India
Krishna Kummamuru  IBM India Research Lab, New Delhi, India
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 6,   Downloads (12 Months): 93,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues   peer to peer  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/775152.775225
What is a DOI?

ABSTRACT

This paper proposes an algorithm to hierarchically cluster documents. Each cluster is actually a cluster of documents and an associated cluster of words, thus a document-word co-cluster. Note that, the vector model for documents creates the document-word matrix, of which every co-cluster is a submatrix. One would intuitively expect a submatrix made up of high values to be a good document cluster, with the corresponding word cluster containing its most distinctive features. Our algorithm looks to exploit this. We have defined matrix density, and our algorithm basically uses matrix density considerations in its working.The algorithm is a partitional-agglomerative algorithm. The partitioning step involves the identification of dense submatrices so that the respective row sets partition the row set of the complete matrix. The hierarchical agglomerative step involves merging the most "similar" submatrices until we are down to the required number of clusters (if we want a flat clustering) or until we have just the single complete matrix left (if we are interested in a hierarchical arrangement of documents). It also generates apt labels for each cluster or hierarchy node. The similarity measure between clusters that we use here for the merging cleverly uses the fact that the clusters here are co-clusters, and is a key point of difference from existing agglomerative algorithms. We will refer to the proposed algorithm as RPSA (Rowset Partitioning and Submatrix Agglomeration). We have compared it as a clustering algorithm with Spherical K-Means and Spectral Graph Partitioning. We have also evaluated some hierarchies generated by the algorithm.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
2
3
 
4
Inderjit S. Dhillon, James Fan, and Yuqiang Guan, "Efficient clustering of very large document collections," in Data Mining for Scientific and Engineering Applications, R. Grossman, G. Kamath, and R. Naburu, Eds. Kluwer Academic Publ., 2001.
 
5
Inderjit S. Dhillon and Dharmendra S. Modha, "Concept decompositions for large sparse text data using clustering," Tech. Rep. RJ 10147, IBM Almaden Research Center, August 1999.
 
6
Michael Steinbach, George Karypis, and Vipin Kumar, "A comparison of document clustering techniques," in KDD Workshop on Text Mining, 2000.
 
7
 
8
9
 
10
 
11
Shigeru Oyanagi, Kazuto Kubota, and Akihiko Nakase, "Application of matrix clustering to web log analysis and access prediction," in WEBKDD, August 2001.
12
13


Collaborative Colleagues:
Bhushan Mandhani: colleagues
Sachindra Joshi: colleagues
Krishna Kummamuru: colleagues

Peer to Peer - Readers of this Article have also read: