ABSTRACT
In this paper, we propose a novel document clustering method based on the non-negative factorization of the term-document matrix of the given document corpus. In the latent semantic space derived by the non-negative matrix factorization (NMF), each axis captures the base topic of a particular document cluster, and each document is represented as an additive combination of the base topics. The cluster membership of each document can be easily determined by finding the base topic (the axis) with which the document has the largest projection value. Our experimental evaluations show that the proposed document clustering method surpasses the latent semantic indexing and the spectral clustering methods not only in the easy and reliable derivation of document clustering results, but also in document clustering accuracies.
- L. Baker and A. McCallum. Distributional clustering of words for text classification. In Proceedings of ACM SIGIR, 1998. Google ScholarDigital Library
- P. K. Chan, D. F. Schlag, and J. Y. Zien. Spectral k-way ratio-cut partitioning an clustering. IEEE Trans. Computer-Aided Design, 13:1088--1096, Sep. 1994.Google ScholarDigital Library
- D. Cutting, D. Karger, J. Pederson, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. In Proceedings of ACM SIGIR, 1992. Google ScholarDigital Library
- S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391--407, 1990.Google ScholarCross Ref
- C. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of IEEE ICDM 2001, pages 107--114, 2001. Google ScholarDigital Library
- P. O. Hoyer. Non-negative sparse coding. In Proc. IEEE Workshop on Neural Networks for Signal Processing, Martigny, Switzerland, 2002.Google ScholarCross Ref
- D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401:788--791, 1999.Google ScholarCross Ref
- D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, volume 13, pages 556--562, 2001.Google Scholar
- X. Liu and Y. Gong. Document clustering with cluster refinement and model selection capabilities. In Proceedings of ACM SIGIR 2002, Tampere, Finland, Aug. 2002. Google ScholarDigital Library
- L. Lovasz and M. Plummer. Matching Theory. Akadémiai Kiadó, North Holland, Budapest, 1986.Google Scholar
- J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2000. Google ScholarDigital Library
- P. Willett. Document clustering using an inverted file approach. Journal of Information Science, 2:223--231, 1990.Google ScholarCross Ref
- H. Zha, C. Ding, M. Gu, X. He, and H. Simon. Spectral relaxation for k-means clustering. In Advances in Neural Information Processing Systems, volume 14, 2002.Google Scholar
Index Terms
- Document clustering based on non-negative matrix factorization
Recommendations
Document Clustering Based on Spectral Clustering and Non-negative Matrix Factorization
IEA/AIE '08: Proceedings of the 21st international conference on Industrial, Engineering and Other Applications of Applied Intelligent Systems: New Frontiers in Applied Artificial IntelligenceIn this paper, we propose a novel non-negative matrix factorization (NMF) to the affinity matrix for document clustering, which enforces non-negativity and orthogonality constraints simultaneously. With the help of orthogonality constraints, this NMF ...
Orthogonal nonnegative matrix tri-factorization for co-clustering: Multiplicative updates on Stiefel manifolds
Matrix factorization-based methods become popular in dyadic data analysis, where a fundamental problem, for example, is to perform document clustering or co-clustering words and documents given a term-document matrix. Nonnegative matrix tri-...
Enhanced clustering of biomedical documents using ensemble non-negative matrix factorization
Searching and mining biomedical literature databases are common ways of generating scientific hypotheses by biomedical researchers. Clustering can assist researchers to form hypotheses by seeking valuable information from grouped documents effectively. ...
Comments