ABSTRACT
Real-world, multiple-typed objects are often interconnected, forming heterogeneous information networks. A major challenge for link-based clustering in such networks is its potential to generate many different results, carrying rather diverse semantic meanings. In order to generate desired clustering, we propose to use meta-path, a path that connects object types via a sequence of relations, to control clustering with distinct semantics. Nevertheless, it is easier for a user to provide a few examples ("seeds") than a weighted combination of sophisticated meta-paths to specify her clustering preference. Thus, we propose to integrate meta-path selection with user-guided clustering to cluster objects in networks, where a user first provides a small set of object seeds for each cluster as guidance. Then the system learns the weights for each meta-path that are consistent with the clustering result implied by the guidance, and generates clusters under the learned weights of meta-paths. A probabilistic approach is proposed to solve the problem, and an effective and efficient iterative algorithm, PathSelClus, is proposed to learn the model, where the clustering quality and the meta-path weights are mutually enhancing each other. Our experiments with several clustering tasks in two real networks demonstrate the power of the algorithm in comparison with the baselines.
Supplemental Material
- E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing. Mixed membership stochastic blockmodels. J. Mach. Learn. Res., 9:1981--2014, June 2008. Google ScholarDigital Library
- A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. J. Mach. Learn. Res., 6:1705--1749, December 2005. Google ScholarCross Ref
- A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall. Learning a mahalanobis metric from equivalence constraints. JMLR, 6, 2005. Google ScholarDigital Library
- S. Basu, A. Banerjee, and R. Mooney. Semi-supervised clustering by seeding. In ICML '02, 2002. Google ScholarDigital Library
- S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. KDD '04, pages 59--68, 2004. Google ScholarDigital Library
- M. Bilenko, S. Basu, and R. J. Mooney. Integrating constraints and metric learning in semi-supervised clustering. In ICML '04, 2004. Google ScholarDigital Library
- D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In ICML '00, pages 167--174, 2000. Google ScholarDigital Library
- I. S. Dhillon, S. Mallela, and R. Kumar. A divisive information-theoretic feature clustering algorithm for text classification. JMLR, 3:1265--1287, 2003. Google ScholarDigital Library
- I. Guyon and A. Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157--1182, Mar. 2003. Google ScholarDigital Library
- T. Hofmann. Probabilistic latent semantic indexing. SIGIR '99, pages 50--57, 1999. Google ScholarDigital Library
- B. Kulis, S. Basu, I. Dhillon, and R. Mooney. Semi-supervised graph clustering: a kernel approach. ICML '05, pages 457--464, 2005. Google ScholarDigital Library
- D. D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In NIPS '00, pages 556--562, 2000.Google Scholar
- B. Long, Z. (mark Zhang, X. Wu, and P. S. Yu. Spectral clustering for multi-type relational data. In ICML '06, pages 585--592, 2006. Google ScholarDigital Library
- B. Long, Z. M. Zhang, and P. S. Yu. A probabilistic framework for relational clustering. In KDD '07, pages 470--479, 2007. Google ScholarDigital Library
- U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17:395--416, December 2007. Google ScholarDigital Library
- M. E. J. Newman. Fast algorithm for detecting community structure in networks. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 69(6), 2004.Google Scholar
- M. E. J. Newman and M. Girvan. Finding and evaluating community structure in networks. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics), 69(2), 2004.Google Scholar
- K. Punera and J. Ghosh. Consensus-based ensembles of soft clusterings. Appl. Artif. Intell., 22:780--810, August 2008. Google ScholarDigital Library
- J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:888--905, 1997. Google ScholarDigital Library
- A. Strehl, J. Ghosh, and C. Cardie. Cluster ensembles - a knowledge reuse framework for combining multiple partitions. Journal of Machine Learning Research, 3:583--617, 2002. Google ScholarDigital Library
- Y. Sun, J. Han, X. Yan, P. S. Yu, and T. Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. In VLDB '11, 2011.Google ScholarDigital Library
- Y. Sun, J. Han, P. Zhao, Z. Yin, H. Cheng, and T. Wu. Rankclus: integrating clustering with ranking for heterogeneous information network analysis. In EDBT '09, pages 565--576, 2009. Google ScholarDigital Library
- Y. Sun, Y. Yu, and J. Han. Ranking-based clustering of heterogeneous information networks with star network schema. In KDD '09, pages 797--806, 2009. Google ScholarDigital Library
- F. Wang, T. Li, X. Wang, S. Zhu, and C. Ding. Community discovery using nonnegative matrix factorization. Data Mining and Knowledge Discovery, 20, 2010. Google ScholarDigital Library
- N. Wang, S. Parthasarathy, K.-L. Tan, and A. K. H. Tung. Csv: visualizing and mining cohesive subgraphs. In SIGMOD '08, pages 445--458, 2008. Google ScholarDigital Library
- X. Xu, N. Yuruk, Z. Feng, and T. A. J. Schweiger. Scan: a structural clustering algorithm for networks. In KDD '07, pages 824--833, 2007. Google ScholarDigital Library
- Z. Xu, I. King, M. R.-T. Lyu, and R. Jin. Discriminative semi-supervised feature selection via manifold regularization. Trans. Neur. Netw., 21:1033--1047, July 2010. Google ScholarDigital Library
- X. Yin, J. Han, and P. S. Yu. Crossclus: user-guided multi-relational clustering. Data Mining and Knowledge Discovery, 2007. Google ScholarDigital Library
- Z. Zhao and H. Liu. Semi-supervised feature selection via spectral analysis. In ICDM '07, 2007.Google ScholarDigital Library
- X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report Carnegie Mellon University-CALD-02-107, Carnegie Mellon University, 2002.Google Scholar
- X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-Supervised learning using gaussian fields and harmonic functions. In ICML '03, pages 912--919, 2003.Google Scholar
Index Terms
- Integrating meta-path selection with user-guided object clustering in heterogeneous information networks
Recommendations
Ranking-based clustering of heterogeneous information networks with star network schema
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data miningA heterogeneous information network is an information network
composed of multiple types of objects. Clustering on such a network may lead to better understanding of both hidden structures of the network and the individual role played by every object in ...
PathSelClus: Integrating Meta-Path Selection with User-Guided Object Clustering in Heterogeneous Information Networks
Special Issue on ACM SIGKDD 2012Real-world, multiple-typed objects are often interconnected, forming heterogeneous information networks. A major challenge for link-based clustering in such networks is their potential to generate many different results, carrying rather diverse semantic ...
A node clustering algorithm for heterogeneous information networks based on node embeddings
AbstractClustering is a very important method to analyze HIN. Thus, several HIN clustering algorithms have been proposed and all these algorithms are based on meta-paths. Meta-path can be used to describe the relationship of target objects. Even though ...
Comments