ABSTRACT
Given a collection of m continuous-valued, one-dimensional empirical probability distributions {P1, ..., Pm}, how can we cluster these distributions efficiently with a nonparametric approach? Such problems arise in many real-world settings where keeping the moments of the distribution is not appropriate, because either some of the moments are not defined or the distributions are heavy-tailed or bi-modal. Examples include mining distributions of inter-arrival times and phone-call lengths. We present an efficient algorithm with a non-parametric model for clustering empirical, one-dimensional, continuous probability distributions. Our algorithm, called ep-means, is based on the Earth Mover's Distance and k-means clustering. We illustrate the utility of ep-means on various data sets and applications. In particular, we demonstrate that ep-means effectively and efficiently clusters probability distributions of mixed and arbitrary shapes, recovering ground-truth clusters exactly in cases where existing methods perform at baseline accuracy. We also demonstrate that ep-means outperforms moment-based classification techniques and discovers useful patterns in a variety of real-world applications.
- D. Applegate, T. Dasu, S. Krishnan, and S. Urbanek. Unsupervised clustering of multidimensional distributions using earth mover distance. In KDD, pages 636--644, 2011. Google ScholarDigital Library
- D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarDigital Library
- R. Giancarlo and F. Utro. Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theoretical Comp. Sci., 428(0):58--79, 2012. Google ScholarDigital Library
- P. Indyk and E. Price. K-median clustering, model-based compressive sensing, and sparse recovery for earth mover distance. In STOC, pages 627--636, 2011. Google ScholarDigital Library
- S. Jegelka, A. Gretton, B. Schölkopf, B. K. Sriperumbudur, and U. von Luxburg. Generalized clustering via kernel embeddings. In KI 2009: Advances in AI, pages 144--152, 2009. Google ScholarDigital Library
- S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Math. Stat., 22(1):79--86, 1951.Google ScholarCross Ref
- M. Meila. Comparing clusterings by the variation of information. In B. Schölkopf and M. K. Warmuth, editors, Learning Theory and Kernel Machines, pages 173--187. 2003.Google ScholarCross Ref
- M. E. Mugavin. Multidimensional scaling: A brief overview. Nurs. Res., 57:64--8, 2008.Google ScholarCross Ref
- P. Raman, J. M. Phillips, and S. Venkatasubramanian. Spatially-aware comparison and consensus for clusterings. CoRR, abs/1102.0026, 2011.Google Scholar
- Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases. In ICCV, pages 59--66, 1998. Google ScholarDigital Library
- O. Schwander and F. Nielsen. Learning mixtures by simplifying kernel density estimators. In F. Nielsen and R. Bhatia, editors, Matrix Information Geometry, pages 403--426. 2013.Google ScholarCross Ref
- S. Shirdhonkar and D. Jacobs. Approximate earth mover's distance in linear time. In CVPR, pages 1--8, 2008.Google Scholar
- N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In ICML, pages 1073--1080, 2009. Google ScholarDigital Library
- V. M. Zolotarev. Lévy metric. In M. Hazewinkel, editor, Encyclopedia of Mathematics. Springer, 2001.Google Scholar
- V. M. Zolotarev. Lévy-Prokhorov metric. In M. Hazewinkel, editor, Encyclopedia of Mathematics. Springer, 2001.Google Scholar
Index Terms
- EP-MEANS: an efficient nonparametric clustering of empirical probability distributions
Recommendations
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering
In this article, a new initial centroid selection for a K-means document clustering algorithm, namely, Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC-K-means), to improve the performance of text document ...
RK-Means Clustering: K-Means with Reliability
This paper presents an RK-means clustering algorithm which is developed for reliable data grouping by introducing a new reliability evaluation to the K-means clustering algorithm. The conventional K-means clustering algorithm has two shortfalls: 1) the ...
Comments