skip to main content
10.1145/1374376.1374474acmconferencesArticle/Chapter ViewAbstractPublication PagesstocConference Proceedingsconference-collections
research-article

A discriminative framework for clustering via similarity functions

Published: 17 May 2008 Publication History

Abstract

Problems of clustering data from pairwise similarity information are ubiquitous in Computer Science. Theoretical treatments typically view the similarity information as ground-truth and then design algorithms to (approximately) optimize various graph-based objective functions. However, in most applications, this similarity information is merely based on some heuristic; the ground truth is really the unknown correct clustering of the data points and the real goal is to achieve low error on the data. In this work, we develop a theoretical approach to clustering from this perspective. In particular, motivated by recent work in learning theory that asks "what natural properties of a similarity (or kernel) function are sufficient to be able to learn well?" we ask "what natural properties of a similarity function are sufficient to be able to cluster well?"
To study this question we develop a theoretical framework that can be viewed as an analog of the PAC learning model for clustering, where the object of study, rather than being a concept class, is a class of (concept, similarity function) pairs, or equivalently, a property the similarity function should satisfy with respect to the ground truth clustering. We then analyze both algorithmic and information theoretic issues in our model. While quite strong properties are needed if the goal is to produce a single approximately-correct clustering, we find that a number of reasonable properties are sufficient under two natural relaxations: (a) list clustering: analogous to the notion of list-decoding, the algorithm can produce a small list of clusterings (which a user can select from) and (b) hierarchical clustering: the algorithm's goal is to produce a hierarchy such that desired clustering is some pruning of this tree (which a user could navigate). We develop a notion of the clustering complexity of a given property (analogous to notions of capacity in learning theory), that characterizes its information-theoretic usefulness for clustering. We analyze this quantity for several natural game-theoretic and learning-theoretic properties, as well as design new efficient algorithms that are able to take advantage of them. Our algorithms for hierarchical clustering combine recent learning-theoretic approaches with linkage-style methods. We also show how our algorithms can be extended to the inductive case, i.e., by using just a constant-sized sample, as in property testing. The analysis here uses regularity-type results of [FK] and [AFKK].

References

[1]
D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In COLT, 2005.
[2]
N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering. In STOC, pages 684--693, 2005.
[3]
N. Alon, W. Fernandez de la Vega, R. Kannan, and M. Karpinski. Random sampling and approximation of max-csps. Journal of Computer and Systems Sciences, 67(2):212--243, 2003.
[4]
N. Alon and N. Kahale. A spectral technique for coloring random 3-colorable graphs. SIAM J. Computing, 26(6):1733 -- 1748, 1997.
[5]
S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In ACM Symposium on Theory of Computing, 2005.
[6]
M.-F. Balcan and A. Blum. On a theory of learning with similarity functions. In International Conference on Machine Learning, 2006.
[7]
M.-F. Balcan, A. Blum, and S. Vempala. On kernels, margins and low-dimensional mappings. Machine Learning Journal, 2006.
[8]
M.-F. Balcan, A. Blum, and S. Vempala. A theory of similarity functions for clustering. Technical Report, CMU-CS-07-142, 2007.
[9]
S. Ben-David. A framework for statistical clustering with constant time approximation for k-means clustering. Machine Learning Journal, 66(2-3):243 -- 257, 2007.
[10]
A. Blum, N. Bansal, and S. Chawla. Correlation clustering. Machine Learning, 56:89--113, 2004.
[11]
S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: a survey of recent advances. ESAIM: Probability and Statistics, 9:9:323--375, 2005.
[12]
M. Charikar, S. Guha, E. Tardos, and D. B. Shmoy. A constant-factor approximation algorithm for the k-median problem. In STOC, 1999.
[13]
M. Charikar, V. Guruswami, and A. Wirth. Clustering with qualitative information. In FOCS, pages 524--533, 2003.
[14]
D. Cheng, R. Kannan, S. Vempala, and G. Wang. A divide-and-merge methodology for clustering. ACM Trans. Database Syst., 31(4):1499--1525, 2006.
[15]
A. Dasgupta, J. Hopcroft, J. Kleinberg, and M. Sandler. On learning mixtures of heavy-tailed distributions. In 46th IEEE Symposium on Foundations of Computer Science, 2005.
[16]
A. Dasgupta, J. E. Hopcroft, R. Kannan, and P. P. Mitra. Spectral clustering by recursive partitioning. In ESA, pages 256--267, 2006.
[17]
S. Dasgupta. Learning mixtures of gaussians. In Fortieth Annual IEEE Symposium on Foundations of Computer Science, 1999.
[18]
W. Fernandez de la Vega, M. Karpinski, C. Kenyon, and Y. Rabani. Approximation schemes for clustering problems. In STOC, 2003.
[19]
L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-Verlag, 1996.
[20]
A. Frieze and R. Kannan. Quick approximation to matrices and applications. Combinatorica, 19(2):175--220, 1999.
[21]
R. Herbrich. Learning Kernel Classifiers. MIT Press, Cambridge, 2002.
[22]
K. Jain and V. V. Vazirani. Approximation algorithms for metric facility location and k-median problems using the primal-dual schema and lagrangian relaxation. JACM, 48(2):274 -- 296, 2001.
[23]
R. Kannan, H. Salmasian, and S. Vempala. The spectral method for general mixture models. In Proc. COLT, 2005.
[24]
R. Kannan, S. Vempala, and A. Vetta. On clusterings: good, bad and spectral. J. ACM, 51(3):497--515, 2004.
[25]
J. Kleinberg. An impossibility theorem for clustering. In NIPS, 2002.
[26]
D. E. Knuth. The Art of Computer Programming. Addison-Wesley, 1997.
[27]
R. I. Kondor and J. Lafferty. Diffusion kernels on graphs and other discrete structures. In Proc. ICML, 2002.
[28]
F. McSherry. Spectral parititioning of random graphs. In Proc. 43rd Symp. Foundations of Computer Science, pages 529--537, 2001.
[29]
B. Scholkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press, 2004.
[30]
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926--1940, 1998.
[31]
J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.
[32]
N. Srebro. How good is a kernel as a similarity function? In Proc. 20th Annual Conference on Learning Theory, 2007.
[33]
C. Swamy. Correlation clustering: Maximizing agreements via semidefinite programming. In SODA, 2004.
[34]
L.G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134--1142, 1984.
[35]
V. N. Vapnik. Statistical Learning Theory. Wiley and Sons, 1998.
[36]
S. Vempala and G. Wang. A spectral algorithm for learning mixture models. J. Comp. Sys. Sci., 68(2):841--860, 2004.

Cited By

View all
  • (2024)Cost-Effective Deployment for Fully-Decoupled RAN: A Techno-Economic ApproachIEEE Transactions on Vehicular Technology10.1109/TVT.2024.342454773:11(17007-17023)Online publication date: Nov-2024
  • (2023)Private distribution learning with public dataProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666437(7184-7215)Online publication date: 10-Dec-2023
  • (2023)“Intelligent Heuristics Are the Future of Computing”ACM Transactions on Intelligent Systems and Technology10.1145/362770814:6(1-39)Online publication date: 14-Nov-2023
  • Show More Cited By

Index Terms

  1. A discriminative framework for clustering via similarity functions

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    STOC '08: Proceedings of the fortieth annual ACM symposium on Theory of computing
    May 2008
    712 pages
    ISBN:9781605580470
    DOI:10.1145/1374376
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 17 May 2008

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. clustering
    2. learning
    3. similarity functions

    Qualifiers

    • Research-article

    Conference

    STOC '08
    Sponsor:
    STOC '08: Symposium on Theory of Computing
    May 17 - 20, 2008
    British Columbia, Victoria, Canada

    Acceptance Rates

    STOC '08 Paper Acceptance Rate 80 of 325 submissions, 25%;
    Overall Acceptance Rate 1,469 of 4,586 submissions, 32%

    Upcoming Conference

    STOC '25
    57th Annual ACM Symposium on Theory of Computing (STOC 2025)
    June 23 - 27, 2025
    Prague , Czech Republic

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)31
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Cost-Effective Deployment for Fully-Decoupled RAN: A Techno-Economic ApproachIEEE Transactions on Vehicular Technology10.1109/TVT.2024.342454773:11(17007-17023)Online publication date: Nov-2024
    • (2023)Private distribution learning with public dataProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3666437(7184-7215)Online publication date: 10-Dec-2023
    • (2023)“Intelligent Heuristics Are the Future of Computing”ACM Transactions on Intelligent Systems and Technology10.1145/362770814:6(1-39)Online publication date: 14-Nov-2023
    • (2023)Online Level-wise Hierarchical ClusteringProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3580305.3599455(1733-1745)Online publication date: 6-Aug-2023
    • (2023)A Characterization of List LearnabilityProceedings of the 55th Annual ACM Symposium on Theory of Computing10.1145/3564246.3585190(1713-1726)Online publication date: 2-Jun-2023
    • (2023)Multi-channel Episodic Memory Building using Recurrent Kernel Machine2023 25th International Multitopic Conference (INMIC)10.1109/INMIC60434.2023.10466071(1-6)Online publication date: 17-Nov-2023
    • (2023)Clustering-based gradual pattern miningInternational Journal of Machine Learning and Cybernetics10.1007/s13042-023-02027-w15:6(2263-2281)Online publication date: 30-Nov-2023
    • (2023)Group and Individual Fairness in Clustering AlgorithmsEthics in Artificial Intelligence: Bias, Fairness and Beyond10.1007/978-981-99-7184-8_2(31-51)Online publication date: 30-Dec-2023
    • (2022)List-decodable sparse mean estimation via difference-of-pairs filteringProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3601284(13947-13960)Online publication date: 28-Nov-2022
    • (2022)Clustering mixture models in almost-linear time via list-decodable mean estimationProceedings of the 54th Annual ACM SIGACT Symposium on Theory of Computing10.1145/3519935.3520014(1262-1275)Online publication date: 9-Jun-2022
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media