ABSTRACT
We study node similarity in labeled networks, using the label sequences found in paths of bounded length q leading to the nodes. (This recalls the q-grams employed in document resemblance, based on the Jaccard distance.) When applied to networks, the challenge is two-fold: the number of q-grams generated from labeled paths grows exponentially with q, and their frequency should be taken into account: this leads to a variation of the Jaccard index known as Bray-Curtis index for multisets. We describe nSimGram, a suite of fast algorithms for node similarity with q-grams, based on a novel blend of color coding, probabilistic counting, sketches, and string algorithms, where the universe of elements to sample is exponential. We provide experimental evidence that our measure is effective and our running times scale to deal with large real-world networks.
Supplemental Material
- Nesreen K Ahmed, Jennifer Neville, Ryan A Rossi, and Nick Duffield. Efficient graphlet counting for large networks. In Data Mining (ICDM), 2015 IEEE International Conference on, pages 1--10. IEEE, 2015. Google ScholarDigital Library
- Alexa. Website traffic, statistics and analytics. https://www.alexa.com/siteinfo/, Accessed October 2017.Google Scholar
- Noga Alon and Shai Gutner. Balanced families of perfect hash functions and their applications. ACM Trans. Algorithms, 6(3):54:1--54:12, July 2010. Google ScholarDigital Library
- Noga Alon, Raphael Yuster, and Uri Zwick. Color-coding. Journal of the ACM (JACM), 42(4):844--856, 1995. Google ScholarDigital Library
- Andrei Z. Broder. Identifying and filtering near-duplicate documents. In Combinatorial Pattern Matching, 11th Annual Symposium, CPM 2000, Montreal, Canada, June 21--23, 2000, Proceedings, pages 1--10, 2000. Google ScholarDigital Library
- David Eppstein and Joseph Wang. Fast approximation of centrality. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 228--229. Society for Industrial and Applied Mathematics, 2001. Google ScholarDigital Library
- Yuan Fang, Wenqing Lin, Vincent W Zheng, Min Wu, Kevin Chen-Chuan Chang, and Xiao-Li Li. Semantic proximity search on graphs with metagraph-based learning. In Data Engineering (ICDE), 2016 IEEE 32nd International Conference on, pages 277--288. IEEE, 2016.Google ScholarCross Ref
- Jörg Flum and Martin Grohe. The parameterized complexity of counting problems. SIAM Journal on Computing, 33(4):892--922, August 2004. Google ScholarDigital Library
- Michael R. Garey and David S. Johnson. Computers and intractability : a guide to the theory of NP-completeness /. San Francisco : W. H. Freeman, 1979, 338 p. CALL NUMBER: QA76.6 .G35, 1979. Google ScholarDigital Library
- P-L Giscard and RC Wilson. The all-paths and cycles graph kernel. arXiv preprint arXiv:1708.01410, 2017.Google Scholar
- IMDb. Imdb datasets. http://www.imdb.com/interfaces/, Accessed October 2017.Google Scholar
- Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In 10th International Conference on Data Mining (ICDM), pages 246--255. IEEE, 2010. Google ScholarDigital Library
- Glen Jeh and Jennifer Widom. Simrank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538--543. ACM, 2002. Google ScholarDigital Library
- Glen Jeh and Jennifer Widom. Scaling personalized web search. In Proceedings of the 12th international conference on World Wide Web, pages 271--279. Acm, 2003. Google ScholarDigital Library
- Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39--43, 1953.Google ScholarCross Ref
- Linyuan Lü and Tao Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6):1150 -- 1170, 2011.Google ScholarCross Ref
- P. Legendre and L.F.J. Legendre. Numerical Ecology. Developments in Environmental Modelling. Elsevier Science, 1998.Google Scholar
- Elizabeth A Leicht, Petter Holme, and Mark EJ Newman. Vertex similarity in networks. Physical Review E, 73(2):026120, 2006.Google ScholarCross Ref
- Jure Leskovec and Julian J Mcauley. Learning to discover social circles in ego networks. In Advances in neural information processing systems, pages 539--547, 2012. Google ScholarDigital Library
- Weiping Liu and Linyuan Lü. Link prediction based on local random walk. EPL (Europhysics Letters), 89(5):58007, 2010.Google Scholar
- Zemin Liu, Vincent W Zheng, Zhou Zhao, Fanwei Zhu, Kevin Chen-Chuan Chang, Minghui Wu, and Jing Ying. Semantic proximity search on heterogeneous graph by proximity embedding. In AAAI, pages 154--160, 2017.Google Scholar
- Sascha Rothe and Hinrich Schütze. Cosimrank: A flexible & efficient graph-theoretic similarity measure. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1392--1402, 2014.Google ScholarCross Ref
- Chuan Shi, Xiangnan Kong, Yue Huang, S Yu Philip, and Bin Wu. Hetesim: A general framework for relevance measure in heterogeneous networks. IEEE Transactions on Knowledge and Data Engineering, 26(10):2479--2492, 2014.Google ScholarCross Ref
- Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and S Yu Philip. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering, 29(1):17--37, 2017. Google ScholarDigital Library
- Chuan Shi, Chong Zhou, Xiangnan Kong, Philip S Yu, Gang Liu, and Bai Wang. Heterecom: a semantic-based recommendation system in heterogeneous networks. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1552--1555. ACM, 2012. Google ScholarDigital Library
- SNAP. Netinf. http://snap.stanford.edu/netinf/, Accessed October 2017.Google Scholar
- Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment, 4(11):992--1003, 2011.Google ScholarDigital Library
- The Google Ngram Viewer Team, part of Google Research. Google Books Ngram Viewer. https://books.google.com/ngrams/info, Accessed February 2018.Google Scholar
- Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast random walk with restart and its applications. In Proceedings of the Sixth International Conference on Data Mining, ICDM '06, pages 613--622, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
- Guan Wang, Qingbo Hu, and Philip S Yu. Influence and similarity on heterogeneous networks. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 1462--1466. ACM, 2012. Google ScholarDigital Library
- Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. Consistent weighted sampling made more practical. In Proceedings of the 26th International Conference on World Wide Web, pages 1035--1043. International World Wide Web Conferences Steering Committee, 2017. Google ScholarDigital Library
- Yun Xiong, Yangyong Zhu, and S Yu Philip. Top-k similarity join in heterogeneous information networks. IEEE Transactions on Knowledge and Data Engineering, 27(6):1710--1723, 2015.Google ScholarDigital Library
Index Terms
- Node Similarity with q -Grams for Real-World Labeled Networks
Recommendations
LSimRank: Node Similarity in a Labeled Graph
Web and Big DataAbstractThe notion of node similarity is useful in many real-world applications. Many existing similarity measurements such as SimRank and its variants have been proposed. Among these measurements, most capture the structural information of a graph only, ...
Graph Node Clustering via Transitive Node Similarity
PCI '10: Proceedings of the 2010 14th Panhellenic Conference on InformaticsThis paper studies the problem of cluster detection in undirected graphs by using transitive node similarity methods. Well-defined semi-metric measures are proposed to compute the similarity between the nodes of the graph, and the clustering is based on ...
Measuring in-network node similarity based on neighborhoods: a unified parametric approach
In many applications, we need to measure similarity between nodes in a large network based on features of their neighborhoods. Although in-network node similarity based on proximity has been well investigated, surprisingly, measuring in-network node ...
Comments