skip to main content
10.1145/3097983.3097989acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Towards an Optimal Subspace for K-Means

Published: 04 August 2017 Publication History

Abstract

Is there an optimal dimensionality reduction for k-means, revealing the prominent cluster structure hidden in the data? We propose SUBKMEANS, which extends the classic k-means algorithm. The goal of this algorithm is twofold: find a sufficient k-means-style clustering partition and transform the clusters onto a common subspace, which is optimal for the cluster structure. Our solution is able to pursue these two goals simultaneously. The dimensionality of this subspace is found automatically and therefore the algorithm comes without the burden of additional parameters. At the same time this subspace helps to mitigate the curse of dimensionality. The SUBKMEANS optimization algorithm is intriguingly simple and efficient. It is easy to implement and can readily be adopted to the current situation. Furthermore, it is compatible to many existing extensions and improvements of k-means.

Supplementary Material

MP4 File (mautz_optimal_subspace.mp4)

References

[1]
Charu C Aggarwal, Joel L Wolf, Philip S Yu, Cecilia Procopiuc, and Jong Soo Park 1999. Fast algorithms for projected clustering. In ACM SIGMoD Record, Vol. Vol. 28. ACM, 61--72.
[2]
Charu C. Aggarwal and Philip S. Yu 2000. Finding Generalized Projected Clusters in High Dimensional Spaces Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD '00). ACM, New York, NY, USA, 70--81. https://doi.org/10.1145/342009.335383
[3]
Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan 1998. Automatic subspace clustering of high dimensional data for data mining applications. Vol. Vol. 27. ACM.
[4]
David Arthur and Sergei Vassilvitskii 2007. k-means: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1027--1035.
[5]
Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means. Proceedings of the VLDB Endowment Vol. 5, 7 (2012), 622--633.
[6]
Christian Böhm, Karin Kailing, Peer Kröger, and Arthur Zimek 2004. Computing clusters of correlation connected objects Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 455--466.
[7]
Christian Böhm and Claudia Plant 2015. Mining Massive Vector Data on Single Instruction Multiple Data Microarchitectures Data Mining Workshop (ICDMW), 2015 IEEE International Conference on. IEEE, 597--606.
[8]
Fernando De la Torre and Takeo Kanade 2006. Discriminative cluster analysis. In Proceedings of the 23rd international conference on Machine learning. ACM, 241--248.
[9]
Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. 2004. Kernel k-means: spectral clustering and normalized cuts Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22--25, 2004. 551--556. https://doi.org/10.1145/1014052.1014118
[10]
Chris Ding and Tao Li. 2007. Adaptive dimension reduction using discriminant analysis and k-means clustering Proceedings of the 24th international conference on Machine learning. ACM, 521--528.
[11]
Joseph C Dunn. 1973. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. (1973).
[12]
Charles Elkan. 2003. Using the triangle inequality to accelerate k-means ICML, Vol. Vol. 3. 147--153.
[13]
Sebastian Goebl, Xiao He, Claudia Plant, and Christian Böhm. 2014. Finding the optimal subspace for clustering. In Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 130--139.
[14]
Robert M. Gray and David L. Neuhoff 1998. Quantization. IEEE transactions on information theory Vol. 44, 6 (1998), 2325--2383.
[15]
Greg Hamerly. 2010. Making k-means even faster. In Proceedings of the 2010 SIAM international conference on data mining. SIAM, 130--140.
[16]
Greg Hamerly and Charles Elkan 2004. Learning the k in k-means. (2004).
[17]
Aapo Hyvärinen and Erkki Oja 2000. Independent component analysis: algorithms and applications. Neural networks, Vol. 13, 4 (2000), 411--430.
[18]
Anil K Jain and Richard C Dubes 1988. Algorithms for clustering data. Prentice-Hall, Inc.
[19]
Ian Jolliffe. 2002. Principal component analysis. Wiley Online Library.
[20]
Brian Kulis and Michael I Jordan 2012. Revisiting k-means: New algorithms via Bayesian nonparametrics. In Proceedings of the 23rd International Conference on Machine Learning (2012).
[21]
Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory Vol. 28, 2 (1982), 129--137.
[22]
Dijun Luo, Chris HQ Ding, and Heng Huang 2011. Linear Discriminant Analysis: New Formulations and Overfit Analysis. AAAI.
[23]
Andrew W Moore. 1999. Very fast EM-based mixture model clustering using multiresolution kd-trees. Advances in Neural information processing systems (1999), 543--549.
[24]
Dan Pelleg and Andrew Moore 1999. Accelerating exact k-means algorithms with geometric reasoning Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 277--281.
[25]
Dan Pelleg, Andrew W Moore, and others 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. ICML, Vol. Vol. 1.
[26]
Steven J Phillips. 2002. Acceleration of k-means and related clustering algorithms Workshop on Algorithm Engineering and Experimentation. Springer, 166--177.
[27]
Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller 1997. Kernel principal component analysis. In International Conference on Artificial Neural Networks. Springer, 583--588.
[28]
Michael Steinbach, George Karypis, Vipin Kumar, and others. 2000. A comparison of document clustering techniques. In KDD workshop on text mining, Vol. Vol. 400. Boston, 525--526.
[29]
Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. science, Vol. 290, 5500 (2000), 2319--2323.
[30]
Max Welling and Kenichi Kurihara 2006. Bayesian K-Means as a" Maximization-Expectation" Algorithm. SDM. SIAM, 474--478.
[31]
Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, S Yu Philip, and others 2008. Top 10 algorithms in data mining. Knowledge and information systems Vol. 14, 1 (2008), 1--37.
[32]
Jieping Ye, Zheng Zhao, and Mingrui Wu 2007. Discriminative K-means for Clustering. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007. 1649--1656. http://papers.nips.cc/paper/3176-discriminative-k-means-for-clustering

Cited By

View all
  • (2024)BPSO-SLM: a binary particle swarm optimization-based self-labeled method for semi-supervised classificationInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-02091-215:8(3255-3277)Online publication date: 19-Feb-2024
  • (2024)Enhancing cluster analysis via topological manifold learningData Mining and Knowledge Discovery10.1007/s10618-023-00980-238:3(840-887)Online publication date: 1-May-2024
  • (2023)Application of Deep Clustering AlgorithmsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615290(5208-5211)Online publication date: 21-Oct-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
August 2017
2240 pages
ISBN:9781450348874
DOI:10.1145/3097983
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. dimensionality reduction
  3. k-means
  4. subspace

Qualifiers

  • Research-article

Conference

KDD '17
Sponsor:

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)6
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)BPSO-SLM: a binary particle swarm optimization-based self-labeled method for semi-supervised classificationInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-02091-215:8(3255-3277)Online publication date: 19-Feb-2024
  • (2024)Enhancing cluster analysis via topological manifold learningData Mining and Knowledge Discovery10.1007/s10618-023-00980-238:3(840-887)Online publication date: 1-May-2024
  • (2023)Application of Deep Clustering AlgorithmsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615290(5208-5211)Online publication date: 21-Oct-2023
  • (2023)Benchmarking Deep Clustering Algorithms With ClustPy2023 IEEE International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW60847.2023.00087(625-632)Online publication date: 4-Dec-2023
  • (2023)k-SubMix: Common Subspace Clustering on Mixed-Type DataMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43412-9_39(662-677)Online publication date: 17-Sep-2023
  • (2022)DBHD: Density-based clustering for highly varying density2022 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM54844.2022.00108(921-926)Online publication date: Nov-2022
  • (2022)ISBFK-meansExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.117018201:COnline publication date: 1-Sep-2022
  • (2021)Deep Embedded K-Means Clustering2021 International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW53433.2021.00090(686-694)Online publication date: Dec-2021
  • (2021)A novel sub-Kmeans based on co-training approach by transforming single-view into multi-viewFuture Generation Computer Systems10.1016/j.future.2021.07.019125:C(831-843)Online publication date: 30-Dec-2021
  • (2021)Utilizing Structure-Rich Features to Improve ClusteringMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-67658-2_6(91-107)Online publication date: 25-Feb-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media