skip to main content
10.1145/1390334.1390368acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Knowledge transformation from word space to document space

Authors Info & Claims
Published:20 July 2008Publication History

ABSTRACT

In most IR clustering problems, we directly cluster the documents, working in the document space, using cosine similarity between documents as the similarity measure. In many real-world applications, however, we usually have knowledge on the word side and wish to transform this knowledge to the document (concept) side. In this paper, we provide a mechanism for this knowledge transformation. To the best of our knowledge, this is the first model for such type of knowledge transformation. This model uses a nonnegative matrix factorization model X = FSGT, where X is the word document semantic matrix, F is the posterior probability of a word belonging to a word cluster and represents knowledge in the word space, G is the posterior probability of a document belonging to a document cluster and represents knowledge in the document space, and S is a scaled matrix factor which provides a condensed view of X. We show how knowledge on words can improve document clustering, i.e, knowledge in the word space is transformed into the document space. We perform extensive experiments to validate our approach.

References

  1. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley Longman, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In Proceedings of ACM SIGKDD, pages 59--68, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002.Google ScholarGoogle Scholar
  4. M. Bilenko, S. Basu, and R. Mooney. Integrating constraints and metric learning in semi-supervised clustering. Proc. Int'l Conf. Machine Learning (ICML2004), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. H. Cho, I. Dhillon, Y. Guan, and S. Sra. Minimum sum squared residue co-clustering of gene expression data. In Proceedings of The 4th SIAM Data Mining Conference, pages 22--24, April 2004.Google ScholarGoogle ScholarCross RefCross Ref
  6. D. Cohn, R. Caruana, and A. McCallum. Semi-supervised clustering with user feedback. Technical Report TR2003-1892, Cornell University, 2003.Google ScholarGoogle Scholar
  7. I. Davidson and S. Ravi. Clustering under constraints: Feasibility results and the k-means algorithm. In Proceedings of SIAM Data Mining Conference, 2005.Google ScholarGoogle Scholar
  8. I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. Proceeding of ACM SIGKDD, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretical co-clustering. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'03), pages 89--98, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Ding and X. He. K-means clustering and principal component analysis. Int'l Conf. Machine Learning (ICML), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Ding, T. Li, W. Peng, and H. Park. Orthogonal nonnegative matrix tri-factorizations for clustering. In Proceedings of ACM SIGKDD, pages 126--135, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Hartigan. Clustering Algorithms. Wiley, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Hofmann. Probabilistic latent semantic indexing. Proc. ACM Conf. on Research and Develop. IR (SIGIR), pages 50--57, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Z. Kou and C. Zhang. Reply networks on a bulletin board system. Phys. Rev. E, (67), 2003.Google ScholarGoogle Scholar
  16. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems 13, Cambridge, MA, 2001. MIT Press.Google ScholarGoogle Scholar
  17. T. Li. A general model for clustering binary data. In KDD, pages 188--197, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Li and C. Ding. The relationships among various nonnegative matrix factorization methods for clustering. In Proceedings of the 2006 IEEE International Conference on Data Mining (ICDM 2006), pages 362--371, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Li, C. Ding, and M. Jordan. Solving consensus and semi-supervised clustering problems using nonnegative matrix factorization. In Proceedings of the 2007 IEEE International Conference on Data Mining (ICDM 2007), pages 577--582, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Long, X. Wu, Z. M. Zhang, and P. S. Yu. Unsupervised learning on k-partite graphs. In Proceedings of ACM SIGKDD, pages 317--326, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Nocedal and S. J. Wright. Numerical Optimization. Springer-Verlag, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  22. N. Slonim and N. Tishby. Document clustering using word clusters via the information bottleneck method. In SIGIR, pages 208--215, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Strehl and J. Ghosh. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research, 3:583--617, March 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. K. Wagstaff, C. Cardie, S. Rogers, and S. Schroedl. Constrained k-means clustering with background knowledge. ICML, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. Wang, T. Li, and C. Zhang. Semi-supervised learning via matrix factorization. In Proceedings of 2008 SIAM International Conference on Data Mining, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  26. E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning, with application to clustering with side-information. NIPS, 2002.Google ScholarGoogle Scholar
  27. H. Zha, C. Ding, M. Gu, X. He, and H. Simon. Spectral relaxation for K-means clustering. NIPS, pages 1057--1064, 2002.Google ScholarGoogle Scholar
  28. H. Zha, X. He, C. Ding, M. Gu, and H. Simon. Bipartite graph partitioning and data clustering. CIKM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Knowledge transformation from word space to document space

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
          July 2008
          934 pages
          ISBN:9781605581644
          DOI:10.1145/1390334

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 July 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader