skip to main content
10.5555/1283383.1283496acmconferencesArticle/Chapter ViewAbstractPublication PagessodaConference Proceedingsconference-collections
Article

A rigorous analysis of population stratification with limited data

Published: 07 January 2007 Publication History

Abstract

Finding the genetic factors of complex diseases such as cancer, currently a major effort of the international community, will potentially lead to better treatment of these diseases. One of the major difficulties in these studies, is the fact that the genetic components of an individual not only depend on the disease, but also on its ethnicity. Therefore, it is crucial to find methods that could reduce the population structure effects on these studies. This can be formalized as a clustering problem, where the individuals are clustered according to their genetic information.
Mathematically, we consider the problem of clustering bit "feature" vectors, where each vector represents the genetic information of an individual. Our model assumes that this bit vector is generated according to a prior probability distribution specified by the individual's membership in a population. We present methods that can cluster the vectors while attempting to optimize the number of features required. The focus of the paper is not on the algorithms, but on showing that optimizing certain objective functions on the data yields the right clustering, under the random generative model. In particular, we prove that some of the previous formulations for clustering are effective.
We consider two different clustering approaches. The first approach forms a graph, and then clusters the data using a connected components algorithm, or a max cut algorithm. The second approach tries to estimate simultanously the feature frequencies in each of the populations, and the classification of vectors into populations. We show that using the first approach Θ(logN2) data (i.e., total number of features times number of vectors) is sufficient to find the correct classification, where N is the number of vectors of each population, and γ is the average l22 distance between the feature probability vectors of the two populations. Using the second approach, we show that O(log N4) data is enough, where α is the average l1 distance between the populations.
We also present polynomial time algorithms for the resulting max margin which, for now, needs only slightly more data than stated above. Our methods can also be used to give a simple combinatorial algorithm for finding a bisection in a random graph that matches Boppana's convex programming approach (and McSherry's spectral results).

References

[1]
D. Achlioptas and F. McSherry. On spectral learning of mixtures of distributions. In Proceedings of the 18th Annual Conference on Learning Theory, pages 458--469, 2005.
[2]
N. Ailon, M. Charikar, and A. Newman. Aggregating inconsistent information: ranking and clustering. In Proceedings of 37th ACM Symposium on Theory of Computing, pages 684--693, 2005.
[3]
S. Arora and R. Kannan. Learning mixtures of arbitrary gaussians. In Proceedings of 33rd ACM Symposium on Theory of Computing, pages 247--257, 2001.
[4]
N. Bansal, A. Blum, and S. Chawla. Correlation clustering. In Proceedings of the 43th IEEE Symposium on Foundations of Computer Science, pages 238--247, 2002.
[5]
R. Boppana. Eigenvalues and graph bisection: An average case analysis. In Proceedings of the 28th IEEE Symposium on Foundations of Computer Science, pages 280--285, 1987.
[6]
John Canny. Collaborative filtering with privacy via factor analysis. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages 238--245, 2002.
[7]
Ted Carson and Russell Impagliazzo. Hill-climbing finds random planted bisections. In Proceedings of SODA 2001, 2001.
[8]
Moses Charikar, Venkatesan Guruswami, and Anthony Wirth. Clustering with qualitative information. In FOCS, pages 524--533, 2003.
[9]
Anne Condon and Richard M. Karp. Algorithms for graph partitioning on the planted partition model. Random Structures and Algorithms, 18(2):116--140, 2001.
[10]
A. Dasgupta, J. Hopcroft, J. Kleinberg, and M. Sandler. On learning mixtures of heavy-tailed distributions. In Proceedings of the 46th IEEE Symposium on Foundations of Computer Science, pages 491--500, 2005.
[11]
S. Dasgupta. Learning mixtures of gaussians. In Proceedings of the 40th IEEE Symposium on Foundations of Computer Science, pages 634--644, 1999.
[12]
Mark Jerrum and Gregory Sorkin. Simulated annealing for graph bisection. In Proceedings of the 34th IEEE Symposium on Foundations of Computer Science, pages 94--103, 1993.
[13]
J. Kleinberg and M. Sandler. Convergent algorithms for collaborative filtering. In Proc. 4th ACM Conference on Electronic Commerce, 2003.
[14]
J. Kleinberg and M. Sandler. Using mixture models for collaborative filtering. In Proceedings of the 36th ACM Symposium on Theory of computing, pages 569--578, 2004.
[15]
Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and Andrew Tomkins. Recommendation systems: A probabilistic analysis. In Proceedings of the 39th IEEE Symposium on Foundations of Computer Science, pages 664--673, 1998.
[16]
Frank McSherry. Spectral partitioning of random graphs. In Proceedings of the 42nd IEEE Symposium on Foundations of Computer Science, pages 529--537, 2001.
[17]
A. Panconesi and D. Dubhashi. Concentration of measure for the analysis of randomised algorithms. Draft.
[18]
Alkes L. Price, Nick J. Patterson, Robert M. Plenge, Michael E. Weinblatt, Nancy A. Shadick, and David Reich. Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904--909, July 2006.
[19]
J. K. Pritchard, M. Stephens, and P. Donnelly. Inference of population structure using multilocus genotype data. Genetics, 155:954--959, June 2000.
[20]
Dekel Tsur Ron Shamir, Roded Sharan. Cluster graph modification problems. In Proceedings of the 28th International Workshop on Graph-Theoretic Concepts in Computer Science, pages 379--390, 2002.
[21]
C. Swamy. Correlation clustering: maximizing agreements via semidefinite programming. In Proceedings of SODA, 2004.
[22]
V. Vempala and G. Wang. A spectral algorithm of learning mixtures of distributions. In Proceedings of the 43rd IEEE Symposium on Foundations of Computer Science, pages 113--123, 2002.
[23]
Shuheng Zhou. Routing, Disjoint Paths and Classification. PhD thesis, Carnegie Mellon University, 2006. CMU-PDL-06-109.

Cited By

View all
  • (2015)Learning Arbitrary Statistical Mixtures of Discrete DistributionsProceedings of the forty-seventh annual ACM symposium on Theory of Computing10.1145/2746539.2746584(743-752)Online publication date: 14-Jun-2015
  • (2014)Learning mixtures of arbitrary distributions over large discrete domainsProceedings of the 5th conference on Innovations in theoretical computer science10.1145/2554797.2554818(207-224)Online publication date: 12-Jan-2014
  • (2007)Separating populations with wide dataProceedings of the 18th international conference on Algorithms and computation10.5555/1781574.1781624(439-451)Online publication date: 17-Dec-2007
  1. A rigorous analysis of population stratification with limited data

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SODA '07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms
    January 2007
    1322 pages
    ISBN:9780898716245
    • Conference Chair:
    • Harold Gabow

    Sponsors

    Publisher

    Society for Industrial and Applied Mathematics

    United States

    Publication History

    Published: 07 January 2007

    Check for updates

    Qualifiers

    • Article

    Acceptance Rates

    SODA '07 Paper Acceptance Rate 139 of 382 submissions, 36%;
    Overall Acceptance Rate 411 of 1,322 submissions, 31%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)2
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 07 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2015)Learning Arbitrary Statistical Mixtures of Discrete DistributionsProceedings of the forty-seventh annual ACM symposium on Theory of Computing10.1145/2746539.2746584(743-752)Online publication date: 14-Jun-2015
    • (2014)Learning mixtures of arbitrary distributions over large discrete domainsProceedings of the 5th conference on Innovations in theoretical computer science10.1145/2554797.2554818(207-224)Online publication date: 12-Jan-2014
    • (2007)Separating populations with wide dataProceedings of the 18th international conference on Algorithms and computation10.5555/1781574.1781624(439-451)Online publication date: 17-Dec-2007

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media