skip to main content
10.1145/1458082.1458226acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Data weaving: scaling up the state-of-the-art in data clustering

Published:26 October 2008Publication History

ABSTRACT

The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information-theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi-dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often outperformed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving - a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal - it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.

References

  1. R. Bekkerman, R. El-Yaniv, and A. McCallum. Multi-way distributional clustering via pairwise interactions. In Proceedings of ICML-22, pages 41--48, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. On feature distributional clustering for text categorization. In Proceedings of SIGIR, pages 146--153, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Bekkerman, M. Sahami, and E. Learned-Miller. Combinatorial Markov Random Fields. In Proceedings of ECML-17, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, 48(3), 1986.Google ScholarGoogle Scholar
  5. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant. Robust information-theoretic clustering. In Proceedings of ACM SIGKDD, pages 65--75, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Brecheisen, H.-P. Kriegel, and M. Pfeifle. Parallel density-based clustering of complex objects. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems (NIPS), 2006.Google ScholarGoogle Scholar
  9. K. Crammer, P. Talukdar, and F. Pereira. A rate-distortion one-class model and its applications to clustering. In Proceedings of the 25st International Conference on Machine Learning, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Symposium on Operating System Design and Implementation (OSDI), pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. Deb and R. A. Angryk. Distributed document clustering using word-clusters. In IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pages 376--383, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  12. I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proceedings of SIGKDD-9, pages 89--98, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. I. S. Dhillon and D. S. Modha. A data clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, volume 1759 of Lecture Notes in Artificial Intelligence, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semi-supervised learning. In Advances in Neural Information Processing Systems (NIPS-14), 2001.Google ScholarGoogle Scholar
  15. G. Forman and B. Zhang. Distributed data clustering can be efficient and exact. SIGKDD Exploration Newsletter, 2(2):34--38, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby. Multivariate information bottleneck. In Proceedings of UAI-17, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In Proceedings of ACM SIGKDD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. P. E. Hadjidoukas and L. Amsaleg. Parallelization of a hierarchical data clustering algorithm using openmp. In Proceedings of the International Workshop on OpenMP (IWOMP), Reims, France, June 2006.Google ScholarGoogle Scholar
  19. M. Johnson, R. H. Liao, A. Rasmussen, R. Sridharan, D. Garcia, and B. Harvey. Infusing parallelism into introductory computer science using mapreduce. In Proceedings of SIGCSE: Symposium on Computer Science Education, 2008.Google ScholarGoogle Scholar
  20. D. Judd, P. K. McKinley, and A. K. Jain. Large-scale parallel data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):871--876, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. L. Lauritzen. Graphical Models. Clarendon Press, 1996.Google ScholarGoogle Scholar
  22. D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. JMLR, 5:361--397, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. K. Liu, H. Kargupta, J. Ryan, and K. Bhaduri. Distributed data mining bibliography. http://www.cs.umbc.edu/~hillol/DDMBIB, 2004.Google ScholarGoogle Scholar
  24. B. Long, Z. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In Proceedings of the ACM SIGKDD, pages 470--479, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In Proceedings of IJCAI-19, pages 786--791, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of ACM SIGKDD, pages 169--178, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Pensa and J.-F. Boulicaut. Constrained co-clustering of gene expression data. In Proceedings of the 2008 SIAM International Conference on Data Mining, pages 25--36, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  28. R. Rocci and M. Vichi. Two-mode multi-partitioning. Computational Statistics and Data Analysis, 52(4), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. N. Rooney, D. Patterson, M. Galushka, and V. Dobrynin. A scaleable document clustering approach for large document corpora. Information Processing and Management, 42(5):1163--1175, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. In Proceedings of SIGIR-25, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. N. Slonim and N. Tishby. Agglomerative information bottleneck. In Advances in Neural Information Processing Systems 12 (NIPS), pages 617--623, 2000.Google ScholarGoogle Scholar
  32. M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI - The Complete Reference: Volume 1, The MPI Core. MIT Press, 2nd edition, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. Sutton and A. McCallum. Piecewise training of undirected models. In Proceedings of UAI-21, 2005.Google ScholarGoogle Scholar
  34. N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method, 1999. Invited paper to the 37th Annual Allerton Conference on Communication, Control, and Computing.Google ScholarGoogle Scholar
  35. X. Xu, J. Jäger, and H.-P. Kriegel. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 3(3):263--290, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data weaving: scaling up the state-of-the-art in data clustering

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management
          October 2008
          1562 pages
          ISBN:9781595939913
          DOI:10.1145/1458082

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 October 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,861of8,427submissions,22%

          Upcoming Conference

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader