ABSTRACT
The enormous amount and dimensionality of data processed by modern data mining tools require effective, scalable unsupervised learning techniques. Unfortunately, the majority of previously proposed clustering algorithms are either effective or scalable. This paper is concerned with information-theoretic clustering (ITC) that has historically been considered the state-of-the-art in clustering multi-dimensional data. Most existing ITC methods are computationally expensive and not easily scalable. Those few ITC methods that scale well (using, e.g., parallelization) are often outperformed by the others, of an inherently sequential nature. First, we justify this observation theoretically. We then propose data weaving - a novel method for parallelizing sequential clustering algorithms. Data weaving is intrinsically multi-modal - it allows simultaneous clustering of a few types of data (modalities). Finally, we use data weaving to parallelize multi-modal ITC, which results in proposing a powerful DataLoom algorithm. In our experimentation with small datasets, DataLoom shows practically identical performance compared to expensive sequential alternatives. On large datasets, however, DataLoom demonstrates significant gains over other parallel clustering methods. To illustrate the scalability, we simultaneously clustered rows and columns of a contingency table with over 120 billion entries.
- R. Bekkerman, R. El-Yaniv, and A. McCallum. Multi-way distributional clustering via pairwise interactions. In Proceedings of ICML-22, pages 41--48, 2005. Google ScholarDigital Library
- R. Bekkerman, R. El-Yaniv, N. Tishby, and Y. Winter. On feature distributional clustering for text categorization. In Proceedings of SIGIR, pages 146--153, 2001. Google ScholarDigital Library
- R. Bekkerman, M. Sahami, and E. Learned-Miller. Combinatorial Markov Random Fields. In Proceedings of ECML-17, 2006.Google ScholarDigital Library
- J. Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society, 48(3), 1986.Google Scholar
- D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarDigital Library
- C. Böhm, C. Faloutsos, J.-Y. Pan, and C. Plant. Robust information-theoretic clustering. In Proceedings of ACM SIGKDD, pages 65--75, 2006. Google ScholarDigital Library
- S. Brecheisen, H.-P. Kriegel, and M. Pfeifle. Parallel density-based clustering of complex objects. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2006.Google ScholarDigital Library
- C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In Advances in Neural Information Processing Systems (NIPS), 2006.Google Scholar
- K. Crammer, P. Talukdar, and F. Pereira. A rate-distortion one-class model and its applications to clustering. In Proceedings of the 25st International Conference on Machine Learning, 2008. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Symposium on Operating System Design and Implementation (OSDI), pages 137--150, 2004. Google ScholarDigital Library
- D. Deb and R. A. Angryk. Distributed document clustering using word-clusters. In IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pages 376--383, 2007.Google ScholarCross Ref
- I. S. Dhillon, S. Mallela, and D. S. Modha. Information-theoretic co-clustering. In Proceedings of SIGKDD-9, pages 89--98, 2003. Google ScholarDigital Library
- I. S. Dhillon and D. S. Modha. A data clustering algorithm on distributed memory multiprocessors. In Large-Scale Parallel Data Mining, volume 1759 of Lecture Notes in Artificial Intelligence, 2000. Google ScholarDigital Library
- R. El-Yaniv and O. Souroujon. Iterative double clustering for unsupervised and semi-supervised learning. In Advances in Neural Information Processing Systems (NIPS-14), 2001.Google Scholar
- G. Forman and B. Zhang. Distributed data clustering can be efficient and exact. SIGKDD Exploration Newsletter, 2(2):34--38, 2000. Google ScholarDigital Library
- N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby. Multivariate information bottleneck. In Proceedings of UAI-17, 2001. Google ScholarDigital Library
- B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma. Consistent bipartite graph co-partitioning for star-structured high-order heterogeneous data co-clustering. In Proceedings of ACM SIGKDD, 2005. Google ScholarDigital Library
- P. E. Hadjidoukas and L. Amsaleg. Parallelization of a hierarchical data clustering algorithm using openmp. In Proceedings of the International Workshop on OpenMP (IWOMP), Reims, France, June 2006.Google Scholar
- M. Johnson, R. H. Liao, A. Rasmussen, R. Sridharan, D. Garcia, and B. Harvey. Infusing parallelism into introductory computer science using mapreduce. In Proceedings of SIGCSE: Symposium on Computer Science Education, 2008.Google Scholar
- D. Judd, P. K. McKinley, and A. K. Jain. Large-scale parallel data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):871--876, 1998. Google ScholarDigital Library
- S. L. Lauritzen. Graphical Models. Clarendon Press, 1996.Google Scholar
- D. D. Lewis, Y. Yang, T. G. Rose, and F. Li. Rcv1: A new benchmark collection for text categorization research. JMLR, 5:361--397, 2004. Google ScholarDigital Library
- K. Liu, H. Kargupta, J. Ryan, and K. Bhaduri. Distributed data mining bibliography. http://www.cs.umbc.edu/~hillol/DDMBIB, 2004.Google Scholar
- B. Long, Z. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In Proceedings of the ACM SIGKDD, pages 470--479, 2007. Google ScholarDigital Library
- A. McCallum, A. Corrada-Emmanuel, and X. Wang. Topic and role discovery in social networks. In Proceedings of IJCAI-19, pages 786--791, 2005.Google ScholarDigital Library
- A. McCallum, K. Nigam, and L. H. Ungar. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of ACM SIGKDD, pages 169--178, 2000. Google ScholarDigital Library
- R. Pensa and J.-F. Boulicaut. Constrained co-clustering of gene expression data. In Proceedings of the 2008 SIAM International Conference on Data Mining, pages 25--36, 2008.Google ScholarCross Ref
- R. Rocci and M. Vichi. Two-mode multi-partitioning. Computational Statistics and Data Analysis, 52(4), 2008. Google ScholarDigital Library
- N. Rooney, D. Patterson, M. Galushka, and V. Dobrynin. A scaleable document clustering approach for large document corpora. Information Processing and Management, 42(5):1163--1175, 2006. Google ScholarDigital Library
- N. Slonim, N. Friedman, and N. Tishby. Unsupervised document classification using sequential information maximization. In Proceedings of SIGIR-25, 2002. Google ScholarDigital Library
- N. Slonim and N. Tishby. Agglomerative information bottleneck. In Advances in Neural Information Processing Systems 12 (NIPS), pages 617--623, 2000.Google Scholar
- M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra. MPI - The Complete Reference: Volume 1, The MPI Core. MIT Press, 2nd edition, 1998. Google ScholarDigital Library
- C. Sutton and A. McCallum. Piecewise training of undirected models. In Proceedings of UAI-21, 2005.Google Scholar
- N. Tishby, F. Pereira, and W. Bialek. The information bottleneck method, 1999. Invited paper to the 37th Annual Allerton Conference on Communication, Control, and Computing.Google Scholar
- X. Xu, J. Jäger, and H.-P. Kriegel. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery, 3(3):263--290, 1999. Google ScholarDigital Library
Index Terms
- Data weaving: scaling up the state-of-the-art in data clustering
Recommendations
Towards information-theoretic K-means clustering for image indexing
Information-theoretic K-means (Info-Kmeans) aims to cluster high-dimensional data, such as images featured by the bag-of-features (BOF) model, using K-means algorithm with KL-divergence as the distance. While research efforts along this line have shown ...
SAIL: summation-based incremental learning for information-theoretic clustering
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningInformation-theoretic clustering aims to exploit information theoretic measures as the clustering criteria. A common practice on this topic is so-called INFO-K-means, which performs K-means clustering with the KL-divergence as the proximity function. ...
Gromov-Wasserstein Multi-modal Alignment and Clustering
CIKM '22: Proceedings of the 31st ACM International Conference on Information & Knowledge ManagementMulti-modal clustering aims at finding a clustering structure shared by the data of different modalities in an unsupervised way. Currently, solving this problem often relies on two assumptions: i) the multi-modal data own the same latent distribution, ...
Comments