skip to main content
10.1145/1008992.1009031acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Document clustering via adaptive subspace iteration

Published: 25 July 2004 Publication History

Abstract

Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI1, which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification via an iterative alternating optimization procedure. Motivated from the optimization procedure, we then provide a novel method to determine the number of clusters. We also discuss the connections of ASI with various existential clustering approaches. Finally, extensive experimental results on real data sets show the effectiveness of ASI algorithm.

References

[1]
Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms for projected clustering. ACM SIGMOD Conference (pp. 61--72).]]
[2]
Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Conference (pp. 94--105).]]
[3]
Anderberg, M. R. (1973). Cluster analysis for applications. Academic Press Inc.]]
[4]
Berger, M., & Rigoutsos, I. (1991). An algorithm for point clustering and grid generation. IEEE Trans. on Systems, Man and Cybernetics, 21, 1278--1286.]]
[5]
Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbor meaningful? Proceedings of 7th International Conference on Database Theory(ICDT'99) (pp. 217--235).]]
[6]
Bock, H.-H. (1989). Probabilistic aspects in cluster analysis. In O. Opitz (Ed.), Conceptual and numerical analysis of data, 12--44. Berlin: Springer-verlag.]]
[7]
Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1999). Document categorization and query generation on the world wide web using webace. AI Review, 13, 365--391.]]
[8]
Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D. (1988). Autoclass: a Bayesian classification system. Proceedings of the Fifteenth International Conference on Machine Learning(ICML'88).]]
[9]
Cheng, Y., & Church, G. M. Biclustering of expression data. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB) (pp. 93--103).]]
[10]
Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. John Wiley and Sons.]]
[11]
Deuflhard, P., Huisinga, W., Fischer, A., & Schutte, C. (2000). Identification of almost invariant aggregates in reversible nearly coupled markov chain. Linear Algebra and Its Applications, 315,39--59.]]
[12]
Dhillon, I. (2001). Co-clustering documents and words using bipartite spectral graph partitioning (Technical Report 2001-05). Department of Computer Science, University of Texas at Austin.]]
[13]
Dhillon, I. S., Mallela, S., & Modha, S. S. (2003). Information-theoretic co-clustering. ACM SIGKDD Conference (pp. 89--98).]]
[14]
Ding, C., He, X., Zha, H., & Simon, H. (2002). Adaptive dimension reduction for clustering high dimensional data. IEEE International Conference on Data Mining(ICDM 2002) (pp. 107--114).]]
[15]
Domeniconi, C., Peng, J., & Gunopulos, D. (2002). Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 1281--1285.]]
[16]
Globerson, A., & Tishby, N. (2003). Sufficient dimensionality reduction. J.Mach.Learn.Res., 3, 1307--1331.]]
[17]
Golub, G. H., & Loan, C. F. V. (1991). Matrix computations. The Johns Hopkins University Press.]]
[18]
Govaert, G. (1985). Simultaneous clustering of rows and columns. Control and Cybernetics, 437--458.]]
[19]
Guha, S., Rastogi, R., & Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Conference (pp. 73--84).]]
[20]
Hagen, L., & Kahng, A. B. (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Computer-Aided Design, 11, 1074--1085.]]
[21]
Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). WebACE: A web agent for document categorization and exploration. Proceedings of the 2nd International Conference on Autonomous Agents (Agents'98) (pp. 408--415).]]
[22]
Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann Publishers.]]
[23]
Hartigan, J. (1975). Clustering algorithms. Wiley.]]
[24]
Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, prediction. Springer.]]
[25]
Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall.]]
[26]
Johnson, R., & Wichern, D. (1998). Applied multivariate statistical analysis. New York: Prentice-Hall.]]
[27]
Kato, L. (1995). Perturbation theory for linear operators. Springer.]]
[28]
Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788--791.]]
[29]
Leung, Y., she Zhang, J., & Xu, Z.-B. (2000). Clustering by scale-space filtering. IEEE Transactions on pattern analysis and machine intelligence, 22, 1396--1410.]]
[30]
Li, T., & Ma, S. (2004). IFD:iterative feature and data clustering. Proceedings of the 2004 SIAM International conference on Data Mining (SDM 2004).]]
[31]
Li, T., Zhu, S., & Ogihara, M. (2003). Efficient multi-way text categorization via generalized discriminant analysis. Proceedings of the Twelfth Conference on Information and Knowledge Management(CIKM 2003) (pp. 317--324).]]
[32]
Linde, Y., Buzo, A., & Gray, R. (1980). An algorithm for vector quantization design. IEEE Transactions on Communications, 28,84--95.]]
[33]
McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow.]]
[34]
Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14 (NIPS'01).]]
[35]
Nishisato, S. (1980). Analysis of categorical data: Dual scaling and its applications. Toronto: University of Toronto Press.]]
[36]
Perona, P., & Freeman, W. (1998). A factorization approach to grouping. Lecture Notes in Computer Science, 1406, 655--670.]]
[37]
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888--905.]]
[38]
Slonim, N., & Tishby, N. (1999). Agglomerative information bottleneck. Advances in Neural Information Processing Systems 12 (NIPS'99).]]
[39]
Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. ACM SIGIR 2000 (pp. 208--215).]]
[40]
Spielman, D. A., & Teng, S.-H. (1996). Spectral partitioning works: Planar graphs and finite element meshes. In IEEE Symposium on Foundations of Computer Science (pp. 96--105).]]
[41]
Tishby, N., Pereira, F. C., & Bialek, W. The information bottleneck method. Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing (pp. 368--377).]]
[42]
Weiss, Y. (1999). Segmentation using eigenvectors: A unifying view. Proceedings of International Conference on Computer Vision-Volume 2 ICCV (2) (pp. 975--982).]]
[43]
Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. ACM SIGIR 2003 (pp. 267--273).]]
[44]
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Conference (pp. 103--114).]]
[45]
Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis (Technical Report). Department of Computer Science, University of Minnesota.]]
[46]
Zhao, Y., & Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets (Technical Report). Department of Computer Science, University of Minnesota.]]
[47]
Zhong, S., & Ghosh, J. (2003). A comparative study of generative models for document clustering. Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference.]]
[48]
Zhu, S., Li, T., & Ogihara, M. (2002). CoFD: An algorithm for non-distance based clustering in high dimensional spaces. 4th International Conference on Data Warehousing and Knowledge Discovery (Dawak 2002) (pp. 52--62).]]

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. adaptive subspace identification
  2. alternating optimization
  3. document clustering
  4. factor analysis

Qualifiers

  • Article

Conference

SIGIR04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)6
  • Downloads (Last 6 weeks)2
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Hidden Variable Models in Text Classification and Sentiment AnalysisElectronics10.3390/electronics1310185913:10(1859)Online publication date: 10-May-2024
  • (2024)Learning a Subspace and Clustering Simultaneously with Manifold Regularized Nonnegative Matrix FactorizationGuidance, Navigation and Control10.1142/S273748072450013404:03Online publication date: 26-Jul-2024
  • (2024)Multi-task Subspace ClusteringInformation Sciences10.1016/j.ins.2024.120147(120147)Online publication date: Jan-2024
  • (2023)Transforming Complex Problems into K-means SolutionsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.3237667(1-20)Online publication date: 2023
  • (2023)One-step unsupervised clustering based on information theoretic metric and adaptive neighbor manifold regularizationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.105880120(105880)Online publication date: Apr-2023
  • (2022)Sparse Count Data Clustering Using an Exponential Approximation to Generalized Dirichlet Multinomial DistributionsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2020.302753933:1(89-102)Online publication date: Jan-2022
  • (2022)Graggle: A Graph-based Approach to Document Clustering2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020645(748-755)Online publication date: 17-Dec-2022
  • (2022)An Analysis of Document Summarization for Educational Data Classification Using NLP with Machine Learning TechniquesApplied Computational Technologies10.1007/978-981-19-2719-5_12(127-143)Online publication date: 15-May-2022
  • (2020)Teaching Natural Language Processing through Big Data Text Summarization with Problem-Based LearningData and Information Management10.2478/dim-2020-00034:1(18-43)Online publication date: 24-Mar-2020
  • (2019)Exploiting multi–core and many–core parallelism for subspace clusteringInternational Journal of Applied Mathematics and Computer Science10.2478/amcs-2019-000629:1(81-91)Online publication date: 1-Mar-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media