research-article

Towards an Optimal Subspace for K-Means

Authors:

Christian BöhmAuthors Info & Claims

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Pages 365 - 373

https://doi.org/10.1145/3097983.3097989

Published: 04 August 2017 Publication History

Abstract

Is there an optimal dimensionality reduction for k-means, revealing the prominent cluster structure hidden in the data? We propose SUBKMEANS, which extends the classic k-means algorithm. The goal of this algorithm is twofold: find a sufficient k-means-style clustering partition and transform the clusters onto a common subspace, which is optimal for the cluster structure. Our solution is able to pursue these two goals simultaneously. The dimensionality of this subspace is found automatically and therefore the algorithm comes without the burden of additional parameters. At the same time this subspace helps to mitigate the curse of dimensionality. The SUBKMEANS optimization algorithm is intriguingly simple and efficient. It is easy to implement and can readily be adopted to the current situation. Furthermore, it is compatible to many existing extensions and improvements of k-means.

Supplementary Material

MP4 File (mautz_optimal_subspace.mp4)

Download
379.68 MB

References

[1]

Charu C Aggarwal, Joel L Wolf, Philip S Yu, Cecilia Procopiuc, and Jong Soo Park 1999. Fast algorithms for projected clustering. In ACM SIGMoD Record, Vol. Vol. 28. ACM, 61--72.

Digital Library

[2]

Charu C. Aggarwal and Philip S. Yu 2000. Finding Generalized Projected Clusters in High Dimensional Spaces Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data (SIGMOD '00). ACM, New York, NY, USA, 70--81. https://doi.org/10.1145/342009.335383

Digital Library

[3]

Rakesh Agrawal, Johannes Gehrke, Dimitrios Gunopulos, and Prabhakar Raghavan 1998. Automatic subspace clustering of high dimensional data for data mining applications. Vol. Vol. 27. ACM.

Digital Library

[4]

David Arthur and Sergei Vassilvitskii 2007. k-means: The advantages of careful seeding. In Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics, 1027--1035.

[5]

Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable k-means. Proceedings of the VLDB Endowment Vol. 5, 7 (2012), 622--633.

Digital Library

[6]

Christian Böhm, Karin Kailing, Peer Kröger, and Arthur Zimek 2004. Computing clusters of correlation connected objects Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM, 455--466.

[7]

Christian Böhm and Claudia Plant 2015. Mining Massive Vector Data on Single Instruction Multiple Data Microarchitectures Data Mining Workshop (ICDMW), 2015 IEEE International Conference on. IEEE, 597--606.

[8]

Fernando De la Torre and Takeo Kanade 2006. Discriminative cluster analysis. In Proceedings of the 23rd international conference on Machine learning. ACM, 241--248.

Digital Library

[9]

Inderjit S. Dhillon, Yuqiang Guan, and Brian Kulis. 2004. Kernel k-means: spectral clustering and normalized cuts Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, Washington, USA, August 22--25, 2004. 551--556. https://doi.org/10.1145/1014052.1014118

Digital Library

[10]

Chris Ding and Tao Li. 2007. Adaptive dimension reduction using discriminant analysis and k-means clustering Proceedings of the 24th international conference on Machine learning. ACM, 521--528.

[11]

Joseph C Dunn. 1973. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. (1973).

[12]

Charles Elkan. 2003. Using the triangle inequality to accelerate k-means ICML, Vol. Vol. 3. 147--153.

Digital Library

[13]

Sebastian Goebl, Xiao He, Claudia Plant, and Christian Böhm. 2014. Finding the optimal subspace for clustering. In Data Mining (ICDM), 2014 IEEE International Conference on. IEEE, 130--139.

Digital Library

[14]

Robert M. Gray and David L. Neuhoff 1998. Quantization. IEEE transactions on information theory Vol. 44, 6 (1998), 2325--2383.

Digital Library

[15]

Greg Hamerly. 2010. Making k-means even faster. In Proceedings of the 2010 SIAM international conference on data mining. SIAM, 130--140.

[16]

Greg Hamerly and Charles Elkan 2004. Learning the k in k-means. (2004).

[17]

Aapo Hyvärinen and Erkki Oja 2000. Independent component analysis: algorithms and applications. Neural networks, Vol. 13, 4 (2000), 411--430.

Digital Library

[18]

Anil K Jain and Richard C Dubes 1988. Algorithms for clustering data. Prentice-Hall, Inc.

Digital Library

[19]

Ian Jolliffe. 2002. Principal component analysis. Wiley Online Library.

[20]

Brian Kulis and Michael I Jordan 2012. Revisiting k-means: New algorithms via Bayesian nonparametrics. In Proceedings of the 23rd International Conference on Machine Learning (2012).

[21]

Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE transactions on information theory Vol. 28, 2 (1982), 129--137.

Digital Library

[22]

Dijun Luo, Chris HQ Ding, and Heng Huang 2011. Linear Discriminant Analysis: New Formulations and Overfit Analysis. AAAI.

[23]

Andrew W Moore. 1999. Very fast EM-based mixture model clustering using multiresolution kd-trees. Advances in Neural information processing systems (1999), 543--549.

[24]

Dan Pelleg and Andrew Moore 1999. Accelerating exact k-means algorithms with geometric reasoning Proceedings of the fifth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 277--281.

[25]

Dan Pelleg, Andrew W Moore, and others 2000. X-means: Extending K-means with Efficient Estimation of the Number of Clusters. ICML, Vol. Vol. 1.

[26]

Steven J Phillips. 2002. Acceleration of k-means and related clustering algorithms Workshop on Algorithm Engineering and Experimentation. Springer, 166--177.

[27]

Bernhard Schölkopf, Alexander Smola, and Klaus-Robert Müller 1997. Kernel principal component analysis. In International Conference on Artificial Neural Networks. Springer, 583--588.

[28]

Michael Steinbach, George Karypis, Vipin Kumar, and others. 2000. A comparison of document clustering techniques. In KDD workshop on text mining, Vol. Vol. 400. Boston, 525--526.

[29]

Joshua B Tenenbaum, Vin De Silva, and John C Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. science, Vol. 290, 5500 (2000), 2319--2323.

[30]

Max Welling and Kenichi Kurihara 2006. Bayesian K-Means as a" Maximization-Expectation" Algorithm. SDM. SIAM, 474--478.

[31]

Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, S Yu Philip, and others 2008. Top 10 algorithms in data mining. Knowledge and information systems Vol. 14, 1 (2008), 1--37.

Digital Library

[32]

Jieping Ye, Zheng Zhao, and Mingrui Wu 2007. Discriminative K-means for Clustering. In Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007. 1649--1656. http://papers.nips.cc/paper/3176-discriminative-k-means-for-clustering

Cited By

Liu RLi J(2024)BPSO-SLM: a binary particle swarm optimization-based self-labeled method for semi-supervised classificationInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-02091-215:8(3255-3277)Online publication date: 19-Feb-2024
https://doi.org/10.1007/s13042-023-02091-2
Herrmann MKazempour DScheipl FKröger P(2024)Enhancing cluster analysis via topological manifold learningData Mining and Knowledge Discovery10.1007/s10618-023-00980-238:3(840-887)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s10618-023-00980-2
Leiber CMiklautz LPlant CBöhm CFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Application of Deep Clustering AlgorithmsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615290(5208-5211)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615290
Show More Cited By

Index Terms

Towards an Optimal Subspace for K-Means

Recommendations

Discovering Non-Redundant K-means Clusterings in Optimal Subspaces
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

A huge object collection in high-dimensional space can often be clustered in more than one way, for instance, objects could be clustered by their shape or alternatively by their color. Each grouping represents a different view of the data set. The new ...
Non-Redundant Subspace Clusterings with Nr-Kmeans and Nr-DipMeans
Special Issue on KDD 2018, Regular Papers and Survey Paper

A huge object collection in high-dimensional space can often be clustered in more than one way, for instance, objects could be clustered by their shape or alternatively by their color. Each grouping represents a different view of the dataset. The new ...
k-means discriminant maps for data visualization and classification
SAC '08: Proceedings of the 2008 ACM symposium on Applied computing

Over the years, many dimensionality reduction algorithms have been proposed for learning the structure of high dimensional data by linearly or non-linearly transforming it into a low-dimensional space. Some techniques can keep the local structure of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 2017

2240 pages

ISBN:9781450348874

DOI:10.1145/3097983

General Chairs:
Stan Matwin
Dalhousie University
,
Shipeng Yu
LinkedIn
,
Faisal Farooq
IBM

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '17

Sponsor:

KDD '17: The 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 13 - 17, 2017

NS, Halifax, Canada

Acceptance Rates

KDD '17 Paper Acceptance Rate 64 of 748 submissions, 9%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

19
Total Citations
View Citations
1,774
Total Downloads

Downloads (Last 12 months)40
Downloads (Last 6 weeks)6

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu RLi J(2024)BPSO-SLM: a binary particle swarm optimization-based self-labeled method for semi-supervised classificationInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-02091-215:8(3255-3277)Online publication date: 19-Feb-2024
https://doi.org/10.1007/s13042-023-02091-2
Herrmann MKazempour DScheipl FKröger P(2024)Enhancing cluster analysis via topological manifold learningData Mining and Knowledge Discovery10.1007/s10618-023-00980-238:3(840-887)Online publication date: 1-May-2024
https://dl.acm.org/doi/10.1007/s10618-023-00980-2
Leiber CMiklautz LPlant CBöhm CFrommholz IHopfgartner FLee MOakes MLalmas MZhang MSantos R(2023)Application of Deep Clustering AlgorithmsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3615290(5208-5211)Online publication date: 21-Oct-2023
https://dl.acm.org/doi/10.1145/3583780.3615290
Leiber CMiklautz LPlant CBöhm C(2023)Benchmarking Deep Clustering Algorithms With ClustPy2023 IEEE International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW60847.2023.00087(625-632)Online publication date: 4-Dec-2023
https://doi.org/10.1109/ICDMW60847.2023.00087
Klein MLeiber CBöhm C(2023)k-SubMix: Common Subspace Clustering on Mixed-Type DataMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43412-9_39(662-677)Online publication date: 17-Sep-2023
https://doi.org/10.1007/978-3-031-43412-9_39
Durani WMautz DPlant CBohm C(2022)DBHD: Density-based clustering for highly varying density2022 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM54844.2022.00108(921-926)Online publication date: Nov-2022
https://doi.org/10.1109/ICDM54844.2022.00108
Yang YCai JYang HLi YZhao X(2022)ISBFK-meansExpert Systems with Applications: An International Journal10.1016/j.eswa.2022.117018201:COnline publication date: 1-Sep-2022
https://dl.acm.org/doi/10.1016/j.eswa.2022.117018
Guo WLin KYe W(2021)Deep Embedded K-Means Clustering2021 International Conference on Data Mining Workshops (ICDMW)10.1109/ICDMW53433.2021.00090(686-694)Online publication date: Dec-2021
https://doi.org/10.1109/ICDMW53433.2021.00090
Nan FTang YYang PHe ZYang Y(2021)A novel sub-Kmeans based on co-training approach by transforming single-view into multi-viewFuture Generation Computer Systems10.1016/j.future.2021.07.019125:C(831-843)Online publication date: 30-Dec-2021
https://dl.acm.org/doi/10.1016/j.future.2021.07.019
Schelling BBauer LBehzadi SPlant C(2021)Utilizing Structure-Rich Features to Improve ClusteringMachine Learning and Knowledge Discovery in Databases10.1007/978-3-030-67658-2_6(91-107)Online publication date: 25-Feb-2021
https://doi.org/10.1007/978-3-030-67658-2_6
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten