research-article

EP-MEANS: an efficient nonparametric clustering of empirical probability distributions

Authors:
Keith Henderson

Lawrence Livermore National Laboratory

Lawrence Livermore National Laboratory
View Profile

,
Brian Gallagher

Lawrence Livermore National Laboratory

Lawrence Livermore National Laboratory
View Profile

,
Tina Eliassi-Rad

Rutgers University

Rutgers University
View Profile

SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied ComputingApril 2015Pages 893–900https://doi.org/10.1145/2695664.2695860

Published:13 April 2015Publication History

SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing

Pages 893–900

ABSTRACT

Given a collection of m continuous-valued, one-dimensional empirical probability distributions {P₁, ..., P_m}, how can we cluster these distributions efficiently with a nonparametric approach? Such problems arise in many real-world settings where keeping the moments of the distribution is not appropriate, because either some of the moments are not defined or the distributions are heavy-tailed or bi-modal. Examples include mining distributions of inter-arrival times and phone-call lengths. We present an efficient algorithm with a non-parametric model for clustering empirical, one-dimensional, continuous probability distributions. Our algorithm, called ep-means, is based on the Earth Mover's Distance and k-means clustering. We illustrate the utility of ep-means on various data sets and applications. In particular, we demonstrate that ep-means effectively and efficiently clusters probability distributions of mixed and arbitrary shapes, recovering ground-truth clusters exactly in cases where existing methods perform at baseline accuracy. We also demonstrate that ep-means outperforms moment-based classification techniques and discovers useful patterns in a variety of real-world applications.

References

D. Applegate, T. Dasu, S. Krishnan, and S. Urbanek. Unsupervised clustering of multidimensional distributions using earth mover distance. In KDD, pages 636--644, 2011. Google ScholarDigital Library
D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarDigital Library
R. Giancarlo and F. Utro. Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theoretical Comp. Sci., 428(0):58--79, 2012. Google ScholarDigital Library
P. Indyk and E. Price. K-median clustering, model-based compressive sensing, and sparse recovery for earth mover distance. In STOC, pages 627--636, 2011. Google ScholarDigital Library
S. Jegelka, A. Gretton, B. Schölkopf, B. K. Sriperumbudur, and U. von Luxburg. Generalized clustering via kernel embeddings. In KI 2009: Advances in AI, pages 144--152, 2009. Google ScholarDigital Library
S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Math. Stat., 22(1):79--86, 1951.Google ScholarCross Ref
M. Meila. Comparing clusterings by the variation of information. In B. Schölkopf and M. K. Warmuth, editors, Learning Theory and Kernel Machines, pages 173--187. 2003.Google ScholarCross Ref
M. E. Mugavin. Multidimensional scaling: A brief overview. Nurs. Res., 57:64--8, 2008.Google ScholarCross Ref
P. Raman, J. M. Phillips, and S. Venkatasubramanian. Spatially-aware comparison and consensus for clusterings. CoRR, abs/1102.0026, 2011.Google Scholar
Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases. In ICCV, pages 59--66, 1998. Google ScholarDigital Library
O. Schwander and F. Nielsen. Learning mixtures by simplifying kernel density estimators. In F. Nielsen and R. Bhatia, editors, Matrix Information Geometry, pages 403--426. 2013.Google ScholarCross Ref
S. Shirdhonkar and D. Jacobs. Approximate earth mover's distance in linear time. In CVPR, pages 1--8, 2008.Google Scholar
N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In ICML, pages 1073--1080, 2009. Google ScholarDigital Library
V. M. Zolotarev. Lévy metric. In M. Hazewinkel, editor, Encyclopedia of Mathematics. Springer, 2001.Google Scholar
V. M. Zolotarev. Lévy-Prokhorov metric. In M. Hazewinkel, editor, Encyclopedia of Mathematics. Springer, 2001.Google Scholar

Index Terms

EP-MEANS: an efficient nonparametric clustering of empirical probability distributions

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Read More
DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering

In this article, a new initial centroid selection for a K-means document clustering algorithm, namely, Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC-K-means), to improve the performance of text document ...
Read More
RK-Means Clustering: K-Means with Reliability

This paper presents an RK-means clustering algorithm which is developed for reliable data grouping by introducing a new reliability evaluation to the K-means clustering algorithm. The conventional K-means clustering algorithm has two shortfalls: 1) the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing
April 2015
2418 pages
ISBN:9781450331968
DOI:10.1145/2695664
Conference Chairs:
Roger L. Wainwright
University of Tulsa
,
Juan Manuel Corchado
University of Salamanca, Spain
,
Program Chairs:
Alessio Bechini
University of Pisa, Italy
,
Jiman Hong
Soongsil University, South Korea
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 April 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
SAC '15 Paper Acceptance Rate291of1,211submissions,24%Overall Acceptance Rate1,650of6,669submissions,25%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 166
  Total Downloads
- Downloads (Last 12 months)36
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

EP-MEANS: an efficient nonparametric clustering of empirical probability distributions

SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering

RK-Means Clustering: K-Means with Reliability

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

EP-MEANS: an efficient nonparametric clustering of empirical probability distributions

SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means for improving the effectiveness of text document clustering

RK-Means Clustering: K-Means with Reliability

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media