Article

Document clustering via adaptive subspace iteration

Authors:

Mitsunori OgiharaAuthors Info & Claims

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 218 - 225

https://doi.org/10.1145/1008992.1009031

Published: 25 July 2004 Publication History

Abstract

Document clustering has long been an important problem in information retrieval. In this paper, we present a new clustering algorithm ASI¹, which uses explicitly modeling of the subspace structure associated with each cluster. ASI simultaneously performs data reduction and subspace identification via an iterative alternating optimization procedure. Motivated from the optimization procedure, we then provide a novel method to determine the number of clusters. We also discuss the connections of ASI with various existential clustering approaches. Finally, extensive experimental results on real data sets show the effectiveness of ASI algorithm.

References

[1]

Aggarwal, C. C., Wolf, J. L., Yu, P. S., Procopiuc, C., & Park, J. S. (1999). Fast algorithms for projected clustering. ACM SIGMOD Conference (pp. 61--72).]]

Digital Library

[2]

Agrawal, R., Gehrke, J., Gunopulos, D., & Raghavan, P. (1998). Automatic subspace clustering of high dimensional data for data mining applications. ACM SIGMOD Conference (pp. 94--105).]]

Digital Library

[3]

Anderberg, M. R. (1973). Cluster analysis for applications. Academic Press Inc.]]

[4]

Berger, M., & Rigoutsos, I. (1991). An algorithm for point clustering and grid generation. IEEE Trans. on Systems, Man and Cybernetics, 21, 1278--1286.]]

[5]

Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbor meaningful? Proceedings of 7th International Conference on Database Theory(ICDT'99) (pp. 217--235).]]

Digital Library

[6]

Bock, H.-H. (1989). Probabilistic aspects in cluster analysis. In O. Opitz (Ed.), Conceptual and numerical analysis of data, 12--44. Berlin: Springer-verlag.]]

[7]

Boley, D., Gini, M., Gross, R., Han, E.-H., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1999). Document categorization and query generation on the world wide web using webace. AI Review, 13, 365--391.]]

Digital Library

[8]

Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., & Freeman, D. (1988). Autoclass: a Bayesian classification system. Proceedings of the Fifteenth International Conference on Machine Learning(ICML'88).]]

[9]

Cheng, Y., & Church, G. M. Biclustering of expression data. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology (ISMB) (pp. 93--103).]]

Digital Library

[10]

Cover, T. M., & Thomas, J. A. (1991). Elements of information theory. John Wiley and Sons.]]

Digital Library

[11]

Deuflhard, P., Huisinga, W., Fischer, A., & Schutte, C. (2000). Identification of almost invariant aggregates in reversible nearly coupled markov chain. Linear Algebra and Its Applications, 315,39--59.]]

[12]

Dhillon, I. (2001). Co-clustering documents and words using bipartite spectral graph partitioning (Technical Report 2001-05). Department of Computer Science, University of Texas at Austin.]]

Digital Library

[13]

Dhillon, I. S., Mallela, S., & Modha, S. S. (2003). Information-theoretic co-clustering. ACM SIGKDD Conference (pp. 89--98).]]

Digital Library

[14]

Ding, C., He, X., Zha, H., & Simon, H. (2002). Adaptive dimension reduction for clustering high dimensional data. IEEE International Conference on Data Mining(ICDM 2002) (pp. 107--114).]]

Digital Library

[15]

Domeniconi, C., Peng, J., & Gunopulos, D. (2002). Locally adaptive metric nearest-neighbor classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 1281--1285.]]

Digital Library

[16]

Globerson, A., & Tishby, N. (2003). Sufficient dimensionality reduction. J.Mach.Learn.Res., 3, 1307--1331.]]

Digital Library

[17]

Golub, G. H., & Loan, C. F. V. (1991). Matrix computations. The Johns Hopkins University Press.]]

[18]

Govaert, G. (1985). Simultaneous clustering of rows and columns. Control and Cybernetics, 437--458.]]

[19]

Guha, S., Rastogi, R., & Shim, K. (1998). CURE: an efficient clustering algorithm for large databases. ACM SIGMOD Conference (pp. 73--84).]]

Digital Library

[20]

Hagen, L., & Kahng, A. B. (1992). New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Computer-Aided Design, 11, 1074--1085.]]

Digital Library

[21]

Han, E.-H., Boley, D., Gini, M., Gross, R., Hastings, K., Karypis, G., Kumar, V., Mobasher, B., & Moore, J. (1998). WebACE: A web agent for document categorization and exploration. Proceedings of the 2nd International Conference on Autonomous Agents (Agents'98) (pp. 408--415).]]

Digital Library

[22]

Han, J., & Kamber, M. (2000). Data mining: Concepts and techniques. Morgan Kaufmann Publishers.]]

Digital Library

[23]

Hartigan, J. (1975). Clustering algorithms. Wiley.]]

Digital Library

[24]

Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, prediction. Springer.]]

[25]

Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall.]]

Digital Library

[26]

Johnson, R., & Wichern, D. (1998). Applied multivariate statistical analysis. New York: Prentice-Hall.]]

Digital Library

[27]

Kato, L. (1995). Perturbation theory for linear operators. Springer.]]

[28]

Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401, 788--791.]]

[29]

Leung, Y., she Zhang, J., & Xu, Z.-B. (2000). Clustering by scale-space filtering. IEEE Transactions on pattern analysis and machine intelligence, 22, 1396--1410.]]

Digital Library

[30]

Li, T., & Ma, S. (2004). IFD:iterative feature and data clustering. Proceedings of the 2004 SIAM International conference on Data Mining (SDM 2004).]]

[31]

Li, T., Zhu, S., & Ogihara, M. (2003). Efficient multi-way text categorization via generalized discriminant analysis. Proceedings of the Twelfth Conference on Information and Knowledge Management(CIKM 2003) (pp. 317--324).]]

Digital Library

[32]

Linde, Y., Buzo, A., & Gray, R. (1980). An algorithm for vector quantization design. IEEE Transactions on Communications, 28,84--95.]]

[33]

McCallum, A. K. (1996). Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/ mccallum/bow.]]

[34]

Ng, A., Jordan, M., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. Advances in Neural Information Processing Systems 14 (NIPS'01).]]

[35]

Nishisato, S. (1980). Analysis of categorical data: Dual scaling and its applications. Toronto: University of Toronto Press.]]

[36]

Perona, P., & Freeman, W. (1998). A factorization approach to grouping. Lecture Notes in Computer Science, 1406, 655--670.]]

Digital Library

[37]

Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 888--905.]]

Digital Library

[38]

Slonim, N., & Tishby, N. (1999). Agglomerative information bottleneck. Advances in Neural Information Processing Systems 12 (NIPS'99).]]

[39]

Slonim, N., & Tishby, N. (2000). Document clustering using word clusters via the information bottleneck method. ACM SIGIR 2000 (pp. 208--215).]]

Digital Library

[40]

Spielman, D. A., & Teng, S.-H. (1996). Spectral partitioning works: Planar graphs and finite element meshes. In IEEE Symposium on Foundations of Computer Science (pp. 96--105).]]

Digital Library

[41]

Tishby, N., Pereira, F. C., & Bialek, W. The information bottleneck method. Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing (pp. 368--377).]]

[42]

Weiss, Y. (1999). Segmentation using eigenvectors: A unifying view. Proceedings of International Conference on Computer Vision-Volume 2 ICCV (2) (pp. 975--982).]]

Digital Library

[43]

Xu, W., Liu, X., & Gong, Y. (2003). Document clustering based on non-negative matrix factorization. ACM SIGIR 2003 (pp. 267--273).]]

Digital Library

[44]

Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: an efficient data clustering method for very large databases. ACM SIGMOD Conference (pp. 103--114).]]

Digital Library

[45]

Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis (Technical Report). Department of Computer Science, University of Minnesota.]]

Digital Library

[46]

Zhao, Y., & Karypis, G. (2002). Evaluation of hierarchical clustering algorithms for document datasets (Technical Report). Department of Computer Science, University of Minnesota.]]

[47]

Zhong, S., & Ghosh, J. (2003). A comparative study of generative models for document clustering. Proceedings of the workshop on Clustering High Dimensional Data and Its Applications in SIAM Data Mining Conference.]]

[48]

Zhu, S., Li, T., & Ogihara, M. (2002). CoFD: An algorithm for non-distance based clustering in high dimensional spaces. 4th International Conference on Data Warehousing and Knowledge Discovery (Dawak 2002) (pp. 52--62).]]

Digital Library

Cited By

Koochemeshkian PIhou Koffi EBouguila N(2024)Hidden Variable Models in Text Classification and Sentiment AnalysisElectronics10.3390/electronics1310185913:10(1859)Online publication date: 10-May-2024
https://doi.org/10.3390/electronics13101859
Nie FChen HHuang HDing CLi X(2024)Learning a Subspace and Clustering Simultaneously with Manifold Regularized Nonnegative Matrix FactorizationGuidance, Navigation and Control10.1142/S273748072450013404:03Online publication date: 26-Jul-2024
https://doi.org/10.1142/S2737480724500134
Zhong GPun C(2024)Multi-task Subspace ClusteringInformation Sciences10.1016/j.ins.2024.120147(120147)Online publication date: Jan-2024
https://doi.org/10.1016/j.ins.2024.120147
Show More Cited By

Index Terms

Document clustering via adaptive subspace iteration
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Tensor clustering via adaptive subspace iteration

Multi-way data or tensors are generalizations of matrices. Clustering multi-way data is a very important research topic due to the intrinsic rich structures in real-world datasets. Despite significant progress made on subspace clustering for two-way ...
Clustering multi-way data via adaptive subspace iteration
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

Clustering multi-way data is a very important research topic due to the intrinsic rich structures in real-world datasets. In this paper, we propose the subspace clustering algorithm on multi-way data, called ASI-T (Adaptive Subspace Iteration on Tensor)...
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

July 2004

624 pages

ISBN:1581138814

DOI:10.1145/1008992

General Chair:
Mark Sanderson
University of Sheffield (UK)
,
Program Chairs:
Kalervo Järvelin
University of Tampere (Finland)
,
James Allan
University of Massachusetts (USA)
,
Peter Bruza
Distributed Systems Technology Centre (Australia)

Copyright © 2004 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR04

Sponsor:

SIGIR04: The 27th ACM/SIGIR International Symposium on Information Retrieval 2004

July 25 - 29, 2004

Sheffield, United Kingdom

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

107
Total Citations
View Citations
1,978
Total Downloads

Downloads (Last 12 months)6
Downloads (Last 6 weeks)2

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Koochemeshkian PIhou Koffi EBouguila N(2024)Hidden Variable Models in Text Classification and Sentiment AnalysisElectronics10.3390/electronics1310185913:10(1859)Online publication date: 10-May-2024
https://doi.org/10.3390/electronics13101859
Nie FChen HHuang HDing CLi X(2024)Learning a Subspace and Clustering Simultaneously with Manifold Regularized Nonnegative Matrix FactorizationGuidance, Navigation and Control10.1142/S273748072450013404:03Online publication date: 26-Jul-2024
https://doi.org/10.1142/S2737480724500134
Zhong GPun C(2024)Multi-task Subspace ClusteringInformation Sciences10.1016/j.ins.2024.120147(120147)Online publication date: Jan-2024
https://doi.org/10.1016/j.ins.2024.120147
Liu HChen JDy JFu Y(2023)Transforming Complex Problems into K-means SolutionsIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2023.3237667(1-20)Online publication date: 2023
https://doi.org/10.1109/TPAMI.2023.3237667
Li XFan HLiu J(2023)One-step unsupervised clustering based on information theoretic metric and adaptive neighbor manifold regularizationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.105880120(105880)Online publication date: Apr-2023
https://doi.org/10.1016/j.engappai.2023.105880
Zamzami NBouguila N(2022)Sparse Count Data Clustering Using an Exponential Approximation to Generalized Dirichlet Multinomial DistributionsIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2020.302753933:1(89-102)Online publication date: Jan-2022
https://doi.org/10.1109/TNNLS.2020.3027539
King IHuang H(2022)Graggle: A Graph-based Approach to Document Clustering2022 IEEE International Conference on Big Data (Big Data)10.1109/BigData55660.2022.10020645(748-755)Online publication date: 17-Dec-2022
https://doi.org/10.1109/BigData55660.2022.10020645
Pandiaraja PBoopesh KDeepthi TLaksmi Priya MNoodhana R(2022)An Analysis of Document Summarization for Educational Data Classification Using NLP with Machine Learning TechniquesApplied Computational Technologies10.1007/978-981-19-2719-5_12(127-143)Online publication date: 15-May-2022
https://doi.org/10.1007/978-981-19-2719-5_12
Li LGeissinger JIngram WFox E(2020)Teaching Natural Language Processing through Big Data Text Summarization with Problem-Based LearningData and Information Management10.2478/dim-2020-00034:1(18-43)Online publication date: 24-Mar-2020
https://doi.org/10.2478/dim-2020-0003
Datta AKaur ALauer TChabbouh S(2019)Exploiting multi–core and many–core parallelism for subspace clusteringInternational Journal of Applied Mathematics and Computer Science10.2478/amcs-2019-000629:1(81-91)Online publication date: 1-Mar-2019
https://dl.acm.org/doi/10.2478/amcs-2019-0006
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten