skip to main content
10.1145/3219819.3220085acmotherconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Node Similarity with q -Grams for Real-World Labeled Networks

Published:19 July 2018Publication History

ABSTRACT

We study node similarity in labeled networks, using the label sequences found in paths of bounded length q leading to the nodes. (This recalls the q-grams employed in document resemblance, based on the Jaccard distance.) When applied to networks, the challenge is two-fold: the number of q-grams generated from labeled paths grows exponentially with q, and their frequency should be taken into account: this leads to a variation of the Jaccard index known as Bray-Curtis index for multisets. We describe nSimGram, a suite of fast algorithms for node similarity with q-grams, based on a novel blend of color coding, probabilistic counting, sketches, and string algorithms, where the universe of elements to sample is exponential. We provide experimental evidence that our measure is effective and our running times scale to deal with large real-world networks.

Skip Supplemental Material Section

Supplemental Material

conte_similarity_q_grams_networks.mp4

mp4

333.5 MB

References

  1. Nesreen K Ahmed, Jennifer Neville, Ryan A Rossi, and Nick Duffield. Efficient graphlet counting for large networks. In Data Mining (ICDM), 2015 IEEE International Conference on, pages 1--10. IEEE, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexa. Website traffic, statistics and analytics. https://www.alexa.com/siteinfo/, Accessed October 2017.Google ScholarGoogle Scholar
  3. Noga Alon and Shai Gutner. Balanced families of perfect hash functions and their applications. ACM Trans. Algorithms, 6(3):54:1--54:12, July 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Noga Alon, Raphael Yuster, and Uri Zwick. Color-coding. Journal of the ACM (JACM), 42(4):844--856, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Andrei Z. Broder. Identifying and filtering near-duplicate documents. In Combinatorial Pattern Matching, 11th Annual Symposium, CPM 2000, Montreal, Canada, June 21--23, 2000, Proceedings, pages 1--10, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. David Eppstein and Joseph Wang. Fast approximation of centrality. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 228--229. Society for Industrial and Applied Mathematics, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yuan Fang, Wenqing Lin, Vincent W Zheng, Min Wu, Kevin Chen-Chuan Chang, and Xiao-Li Li. Semantic proximity search on graphs with metagraph-based learning. In Data Engineering (ICDE), 2016 IEEE 32nd International Conference on, pages 277--288. IEEE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jörg Flum and Martin Grohe. The parameterized complexity of counting problems. SIAM Journal on Computing, 33(4):892--922, August 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Michael R. Garey and David S. Johnson. Computers and intractability : a guide to the theory of NP-completeness /. San Francisco : W. H. Freeman, 1979, 338 p. CALL NUMBER: QA76.6 .G35, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P-L Giscard and RC Wilson. The all-paths and cycles graph kernel. arXiv preprint arXiv:1708.01410, 2017.Google ScholarGoogle Scholar
  11. IMDb. Imdb datasets. http://www.imdb.com/interfaces/, Accessed October 2017.Google ScholarGoogle Scholar
  12. Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In 10th International Conference on Data Mining (ICDM), pages 246--255. IEEE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Glen Jeh and Jennifer Widom. Simrank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538--543. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Glen Jeh and Jennifer Widom. Scaling personalized web search. In Proceedings of the 12th international conference on World Wide Web, pages 271--279. Acm, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39--43, 1953.Google ScholarGoogle ScholarCross RefCross Ref
  16. Linyuan Lü and Tao Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6):1150 -- 1170, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  17. P. Legendre and L.F.J. Legendre. Numerical Ecology. Developments in Environmental Modelling. Elsevier Science, 1998.Google ScholarGoogle Scholar
  18. Elizabeth A Leicht, Petter Holme, and Mark EJ Newman. Vertex similarity in networks. Physical Review E, 73(2):026120, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  19. Jure Leskovec and Julian J Mcauley. Learning to discover social circles in ego networks. In Advances in neural information processing systems, pages 539--547, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Weiping Liu and Linyuan Lü. Link prediction based on local random walk. EPL (Europhysics Letters), 89(5):58007, 2010.Google ScholarGoogle Scholar
  21. Zemin Liu, Vincent W Zheng, Zhou Zhao, Fanwei Zhu, Kevin Chen-Chuan Chang, Minghui Wu, and Jing Ying. Semantic proximity search on heterogeneous graph by proximity embedding. In AAAI, pages 154--160, 2017.Google ScholarGoogle Scholar
  22. Sascha Rothe and Hinrich Schütze. Cosimrank: A flexible & efficient graph-theoretic similarity measure. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1392--1402, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  23. Chuan Shi, Xiangnan Kong, Yue Huang, S Yu Philip, and Bin Wu. Hetesim: A general framework for relevance measure in heterogeneous networks. IEEE Transactions on Knowledge and Data Engineering, 26(10):2479--2492, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  24. Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and S Yu Philip. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering, 29(1):17--37, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Chuan Shi, Chong Zhou, Xiangnan Kong, Philip S Yu, Gang Liu, and Bai Wang. Heterecom: a semantic-based recommendation system in heterogeneous networks. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1552--1555. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. SNAP. Netinf. http://snap.stanford.edu/netinf/, Accessed October 2017.Google ScholarGoogle Scholar
  27. Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment, 4(11):992--1003, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. The Google Ngram Viewer Team, part of Google Research. Google Books Ngram Viewer. https://books.google.com/ngrams/info, Accessed February 2018.Google ScholarGoogle Scholar
  29. Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast random walk with restart and its applications. In Proceedings of the Sixth International Conference on Data Mining, ICDM '06, pages 613--622, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Guan Wang, Qingbo Hu, and Philip S Yu. Influence and similarity on heterogeneous networks. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 1462--1466. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. Consistent weighted sampling made more practical. In Proceedings of the 26th International Conference on World Wide Web, pages 1035--1043. International World Wide Web Conferences Steering Committee, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yun Xiong, Yangyong Zhu, and S Yu Philip. Top-k similarity join in heterogeneous information networks. IEEE Transactions on Knowledge and Data Engineering, 27(6):1710--1723, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Node Similarity with q -Grams for Real-World Labeled Networks

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
            July 2018
            2925 pages
            ISBN:9781450355520
            DOI:10.1145/3219819

            Copyright © 2018 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 19 July 2018

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader