skip to main content
10.1145/2695664.2695860acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

EP-MEANS: an efficient nonparametric clustering of empirical probability distributions

Published:13 April 2015Publication History

ABSTRACT

Given a collection of m continuous-valued, one-dimensional empirical probability distributions {P1, ..., Pm}, how can we cluster these distributions efficiently with a nonparametric approach? Such problems arise in many real-world settings where keeping the moments of the distribution is not appropriate, because either some of the moments are not defined or the distributions are heavy-tailed or bi-modal. Examples include mining distributions of inter-arrival times and phone-call lengths. We present an efficient algorithm with a non-parametric model for clustering empirical, one-dimensional, continuous probability distributions. Our algorithm, called ep-means, is based on the Earth Mover's Distance and k-means clustering. We illustrate the utility of ep-means on various data sets and applications. In particular, we demonstrate that ep-means effectively and efficiently clusters probability distributions of mixed and arbitrary shapes, recovering ground-truth clusters exactly in cases where existing methods perform at baseline accuracy. We also demonstrate that ep-means outperforms moment-based classification techniques and discovers useful patterns in a variety of real-world applications.

References

  1. D. Applegate, T. Dasu, S. Krishnan, and S. Urbanek. Unsupervised clustering of multidimensional distributions using earth mover distance. In KDD, pages 636--644, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. Arthur and S. Vassilvitskii. K-means++: The advantages of careful seeding. In SODA, pages 1027--1035, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Giancarlo and F. Utro. Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis. Theoretical Comp. Sci., 428(0):58--79, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Indyk and E. Price. K-median clustering, model-based compressive sensing, and sparse recovery for earth mover distance. In STOC, pages 627--636, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Jegelka, A. Gretton, B. Schölkopf, B. K. Sriperumbudur, and U. von Luxburg. Generalized clustering via kernel embeddings. In KI 2009: Advances in AI, pages 144--152, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Math. Stat., 22(1):79--86, 1951.Google ScholarGoogle ScholarCross RefCross Ref
  7. M. Meila. Comparing clusterings by the variation of information. In B. Schölkopf and M. K. Warmuth, editors, Learning Theory and Kernel Machines, pages 173--187. 2003.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. E. Mugavin. Multidimensional scaling: A brief overview. Nurs. Res., 57:64--8, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  9. P. Raman, J. M. Phillips, and S. Venkatasubramanian. Spatially-aware comparison and consensus for clusterings. CoRR, abs/1102.0026, 2011.Google ScholarGoogle Scholar
  10. Y. Rubner, C. Tomasi, and L. J. Guibas. A metric for distributions with applications to image databases. In ICCV, pages 59--66, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. O. Schwander and F. Nielsen. Learning mixtures by simplifying kernel density estimators. In F. Nielsen and R. Bhatia, editors, Matrix Information Geometry, pages 403--426. 2013.Google ScholarGoogle ScholarCross RefCross Ref
  12. S. Shirdhonkar and D. Jacobs. Approximate earth mover's distance in linear time. In CVPR, pages 1--8, 2008.Google ScholarGoogle Scholar
  13. N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison: Is a correction for chance necessary? In ICML, pages 1073--1080, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. V. M. Zolotarev. Lévy metric. In M. Hazewinkel, editor, Encyclopedia of Mathematics. Springer, 2001.Google ScholarGoogle Scholar
  15. V. M. Zolotarev. Lévy-Prokhorov metric. In M. Hazewinkel, editor, Encyclopedia of Mathematics. Springer, 2001.Google ScholarGoogle Scholar

Index Terms

  1. EP-MEANS: an efficient nonparametric clustering of empirical probability distributions

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SAC '15: Proceedings of the 30th Annual ACM Symposium on Applied Computing
          April 2015
          2418 pages
          ISBN:9781450331968
          DOI:10.1145/2695664

          Copyright © 2015 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 April 2015

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          SAC '15 Paper Acceptance Rate291of1,211submissions,24%Overall Acceptance Rate1,650of6,669submissions,25%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader