skip to main content
10.1145/1277741.1277760acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Regularized clustering for documents

Published: 23 July 2007 Publication History

Abstract

In recent years, document clustering has been receiving more and more attentions as an important and fundamental technique for unsupervised document organization, automatictopic extraction, and fast information retrieval or filtering. In this paper, we propose a novel method for clustering documents using regularization. Unlike traditional globally regularized clustering methods, our method first construct a local regularized linear label predictor for each document vector, and then combine all those local regularizers with a global smoothness regularizer. So we call our algorithm Clustering with Local and Global Regularization (CLGR). We will show that the cluster memberships of the documents can be achieved by eigenvalue decomposition of a sparse symmetric matrix, which can be efficiently solved by iterative methods. Finally our experimental evaluations on several datasets are presented to show the superiorities of CLGR over traditional document clustering methods.

References

[1]
L. Baker and A. McCallum. Distributional Clustering of Words for Text Classification. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998.
[2]
M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15 (6):1373--1396. June 2003.
[3]
M. Belkin and P. Niyogi. Towards a Theoretical Foundation for Laplacian-Based Manifold Methods. In Proceedings of the 18th Conference on Learning Theory (COLT). 2005.
[4]
M. Belkin, P. Niyogi and V. Sindhwani. Manifold Regularization: a Geometric Framework for Learning from Examples. Journal of Machine Learning Research 7, 1--48, 2006.
[5]
D. Boley. Principal Direction Divisive Partitioning. Data mining and knowledge discovery, 2:325--344, 1998.
[6]
L. Bottou and V. Vapnik. Local learning algorithms. Neural Computation, 4:888--900, 1992.
[7]
P. K. Chan, D. F. Schlag and J. Y. Zien. Spectral K-way Ratio-Cut Partitioning and Clustering. IEEE Trans. Computer-Aided Design, 13:1088--1096, Sep. 1994.
[8]
D. R. Cutting, D. R. Karger, J. O. Pederson and J. W. Tukey. Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992.
[9]
I. S. Dhillon and D. S. Modha. Concept Decompositions for Large Sparse Text Data using Clustering. Machine Learning, vol. 42(1), pages 143--175, January 2001.
[10]
C. Ding, X. He, and H. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM Data Mining Conference, 2005.
[11]
C. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the 1st International Conference on Data Mining (ICDM), pages 107--114, 2001.
[12]
C. Ding, T. Li, W. Peng, and H. Park. Orthogonal Nonnegative Matrix Tri-Factorizations for Clustering. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006.
[13]
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons, Inc., 2001.
[14]
T. Li, S. Ma, and M. Ogihara. Document Clustering via Adaptive Subspace Iteration. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004.
[15]
T. Li and C. Ding. The Relationships Among Various Nonnegative Matrix Factorization Methods for Clustering. In Proceedings of the 6th International Conference on Data Mining (ICDM). 2006.
[16]
X. Liu and Y. Gong. Document Clustering with Cluster Refinement and Model Selection Capabilities. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002.
[17]
E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. WebACE: A Web Agent for Document Categorization and Exploration. In Proceedings of the 2nd International Conference on Autonomous Agents (Agents98 ). ACM Press, 1998.
[18]
M. Hein, J. Y. Audibert, and U. von Luxburg. From Graphs to Manifolds - Weak and Strong Pointwise Consistency of Graph Laplacians. In Proceedings of the 18th Conference on Learning Theory (COLT), 470--485. 2005.
[19]
J. He, M. Lan, C. -L. Tan, S. -Y. Sung, and H. -B. Low. Initialization of Cluster Refinement Algorithms: A Review and Comparative Study. In Proceedings of International Joint Conference on Neural Networks, 2004.
[20]
A. Y. Ng, M. I. Jordan, Y. Weiss. On Spectral Clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14. 2002.
[21]
B. SchÄolkopf and A. Smola. Learning with Kernels. The MIT Press. Cambridge, Massachusetts. 2002.
[22]
J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2000.
[23]
A. Strehl and J. Ghosh. Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 3:583--617, 2002.
[24]
V. N. Vapnik. The Nature of Statistical Learning Theory. Berlin: Springer-Verlag, 1995.
[25]
Wu, M. and SchÄolkopf, B. A Local Learning Approach for Clustering. In Advances in Neural Information Processing Systems 18. 2006.
[26]
S. X. Yu, J. Shi. Multiclass Spectral Clustering. In Proceedings of the International Conference on Computer Vision, 2003.
[27]
W. Xu, X. Liu and Y. Gong. Document Clustering Based On Non-Negative Matrix Factorization. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003.
[28]
H. Zha, X. He, C. Ding, M. Gu and H. Simon. Spectral Relaxation for K-means Clustering. In NIPS 14. 2001.
[29]
T. Zhang and F. J. Oles. Text Categorization Based on Regularized Linear Classification Methods. Journal of Information Retrieval, 4:5--31, 2001.
[30]
L. Zelnik-Manor and P. Perona. Self-Tuning Spectral Clustering. In NIPS 17. 2005.
[31]
D. Zhou, O. Bousquet, T. N. Lal, J. Weston and B. Scholkopf. Learning with Local and Global Consistency. NIPS 17, 2005.

Cited By

View all
  • (2023)Opinion Mining Using Optimized K-Means Algorithm and a Word Weighting TechniqueSN Computer Science10.1007/s42979-023-02151-y4:6Online publication date: 27-Sep-2023
  • (2021)Local Learning Joint with the Adaptive Graph for Subspace Representation2021 16th International Conference on Intelligent Systems and Knowledge Engineering (ISKE)10.1109/ISKE54062.2021.9755411(207-214)Online publication date: 26-Nov-2021
  • (2019)Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian ApproachAdvances in Knowledge Discovery and Data Mining10.1007/978-3-030-16142-2_6(68-80)Online publication date: 20-Mar-2019
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 July 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. document clustering
  2. regularization

Qualifiers

  • Article

Conference

SIGIR07
Sponsor:
SIGIR07: The 30th Annual International SIGIR Conference
July 23 - 27, 2007
Amsterdam, The Netherlands

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)4
  • Downloads (Last 6 weeks)1
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Opinion Mining Using Optimized K-Means Algorithm and a Word Weighting TechniqueSN Computer Science10.1007/s42979-023-02151-y4:6Online publication date: 27-Sep-2023
  • (2021)Local Learning Joint with the Adaptive Graph for Subspace Representation2021 16th International Conference on Intelligent Systems and Knowledge Engineering (ISKE)10.1109/ISKE54062.2021.9755411(207-214)Online publication date: 26-Nov-2021
  • (2019)Sequential Embedding Induced Text Clustering, a Non-parametric Bayesian ApproachAdvances in Knowledge Discovery and Data Mining10.1007/978-3-030-16142-2_6(68-80)Online publication date: 20-Mar-2019
  • (2015)VRCAProceedings of the 24th International Conference on Artificial Intelligence10.5555/2832415.2832576(2355-2361)Online publication date: 25-Jul-2015
  • (2015)Discovering Latent Semantics in Web Documents Using Fuzzy ClusteringIEEE Transactions on Fuzzy Systems10.1109/TFUZZ.2015.240387823:6(2122-2134)Online publication date: 1-Dec-2015
  • (2014)Cluster approach to the efficient use of multimedia resources in information warfare in wikimediaAutomatic Control and Computer Sciences10.3103/S014641161402002348:2(97-108)Online publication date: 10-May-2014
  • (2014)Adaptive Centroid-Based Clustering Algorithm for Text Document Data2014 Sixth International Symposium on Parallel Architectures, Algorithms and Programming10.1109/PAAP.2014.13(63-68)Online publication date: Jul-2014
  • (2013)Clustering tagged documents with labeled and unlabeled documentsInformation Processing and Management: an International Journal10.1016/j.ipm.2012.12.00449:3(596-606)Online publication date: 1-May-2013
  • (2013)Towards graphical models for text processingKnowledge and Information Systems10.1007/s10115-012-0552-336:1(1-21)Online publication date: 1-Jul-2013
  • (2013)Enhancing Document Clustering Using Reweighting Terms Based on Semantic FeaturesFuture Information Communication Technology and Applications10.1007/978-94-007-6516-0_28(257-264)Online publication date: 25-May-2013
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media