ABSTRACT
In most IR clustering problems, we directly cluster the documents, working in the document space, using cosine similarity between documents as the similarity measure. In many real-world applications, however, we usually have knowledge on the word side and wish to transform this knowledge to the document (concept) side. In this paper, we provide a mechanism for this knowledge transformation. To the best of our knowledge, this is the first model for such type of knowledge transformation. This model uses a nonnegative matrix factorization model X = FSGT, where X is the word document semantic matrix, F is the posterior probability of a word belonging to a word cluster and represents knowledge in the word space, G is the posterior probability of a document belonging to a document cluster and represents knowledge in the document space, and S is a scaled matrix factor which provides a condensed view of X. We show how knowledge on words can improve document clustering, i.e, knowledge in the word space is transformed into the document space. We perform extensive experiments to validate our approach.
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman, 1999. Google ScholarDigital Library
- S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In Proceedings of ACM SIGKDD, pages 59--68, 2004. Google ScholarDigital Library
- P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.Google Scholar
- M. Bilenko, S. Basu, and R. Mooney. Integrating constraints and metric learning in semi-supervised clustering. Proc. Int'l Conf. Machine Learning (ICML2004), 2004. Google ScholarDigital Library
- H. Cho, I. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-clustering of gene expression data. In Proceedings of The 4th SIAM Data Mining Conference, pages 22--24, April 2004.Google ScholarCross Ref
- D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University, 2003.Google Scholar
- I. Davidson and S. Ravi. Clustering under constraints: Feasibility results and the k-means algorithm. In Proceedings of SIAM Data Mining Conference, 2005.Google Scholar
- I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. Proceeding of ACM SIGKDD, 2001. Google ScholarDigital Library
- I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretical co-clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), pages 89--98, 2003. Google ScholarDigital Library
- C. Ding and X. He. K-means clustering and principal component analysis. Int'l Conf. Machine Learning (ICML), 2004. Google ScholarDigital Library
- C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of ACM SIGKDD, pages 126--135, 2006. Google ScholarDigital Library
- J. Hartigan. Clustering Algorithms. Wiley, 1975. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. Proc. ACM Conf. on Research and Develop. IR (SIGIR), pages 50--57, 1999. Google ScholarDigital Library
- A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarDigital Library
- Z. Kou and C. Zhang. Reply networks on a bulletin board system. Phys. Rev. E, (67), 2003.Google Scholar
- D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13, Cambridge, MA, 2001. MIT Press.Google Scholar
- T. Li. A general model for clustering binary data. In KDD, pages 188--197, 2005. Google ScholarDigital Library
- T. Li and C. Ding. The relationships among various nonnegative matrix factorization methods for clustering. In Proceedings of the 2006 IEEE International Conference on Data Mining (ICDM 2006), pages 362--371, 2006. Google ScholarDigital Library
- T. Li, C. Ding, and M. Jordan. Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In Proceedings of the 2007 IEEE International Conference on Data Mining (ICDM 2007), pages 577--582, 2007. Google ScholarDigital Library
- B. Long, X. Wu, Z. M. Zhang, and P. S. Yu. Unsupervised learning on k-partite graphs. In Proceedings of ACM SIGKDD, pages 317--326, 2006. Google ScholarDigital Library
- J. Nocedal and S. J. Wright. Numerical Optimization. Springer-Verlag, 1999.Google ScholarCross Ref
- N. Slonim and N. Tishby. Document clustering using word clusters via the information bottleneck method. In SIGIR, pages 208--215, 2000. Google ScholarDigital Library
- A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3:583--617, March 2003. Google ScholarDigital Library
- K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. ICML, 2001. Google ScholarDigital Library
- F. Wang, T. Li, and C. Zhang. Semi-supervised learning via matrix factorization. In Proceedings of 2008 SIAM International Conference on Data Mining, 2008.Google ScholarCross Ref
- E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. NIPS, 2002.Google Scholar
- H. Zha, C. Ding, M. Gu, X. He, and H. Simon. Spectral relaxation for K-means clustering. NIPS, pages 1057--1064, 2002.Google Scholar
- H. Zha, X. He, C. Ding, M. Gu, and H. Simon. Bipartite graph partitioning and data clustering. CIKM, 2001. Google ScholarDigital Library
Index Terms
- Knowledge transformation from word space to document space
Recommendations
Text document clustering based on frequent word meaning sequences
Most of existing text clustering algorithms use the vector space model, which treats documents as bags of words. Thus, word sequences in the documents are ignored, while the meaning of natural languages strongly depends on them. In this paper, we ...
Word Space: A New Approach to Describe Word Meanings
CIT '06: Proceedings of the Sixth IEEE International Conference on Computer and Information TechnologyThe purpose of this research is to acquire new knowledge about the meanings of words by arranging the words in space. We allocated the words and displayed their relationship with each other in a three-dimensional space with a method called Associated ...
Text document clustering based on frequent word sequences
CIKM '05: Proceedings of the 14th ACM international conference on Information and knowledge managementIn this paper, we propose a new text clustering algorithm, named Clustering based on Frequent Word Sequences (CFWS). A word sequence is frequent if it occurs in more than certain percentage of the documents in the text database. In the past, the vector ...
Comments