research-article

Node Similarity with q -Grams for Real-World Labeled Networks

Authors:
Alessio Conte

National Institute of Informatics, Tokyo, Japan

National Institute of Informatics, Tokyo, Japan
View Profile

,
Gaspare Ferraro

University of Pisa, Pisa, Italy

University of Pisa, Pisa, Italy
View Profile

,
Roberto Grossi

University of Pisa, Pisa, Italy

University of Pisa, Pisa, Italy
View Profile

,
Andrea Marino

University of Pisa, Pisa, Italy

University of Pisa, Pisa, Italy
View Profile

,
Kunihiko Sadakane

University of Tokyo, Tokyo, Japan

University of Tokyo, Tokyo, Japan
View Profile

,
Takeaki Uno

National Institute of Informatics, Tokyo, Japan

National Institute of Informatics, Tokyo, Japan
View Profile

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningJuly 2018Pages 1282–1291https://doi.org/10.1145/3219819.3220085

Published:19 July 2018Publication History

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

Pages 1282–1291

ABSTRACT

We study node similarity in labeled networks, using the label sequences found in paths of bounded length q leading to the nodes. (This recalls the q-grams employed in document resemblance, based on the Jaccard distance.) When applied to networks, the challenge is two-fold: the number of q-grams generated from labeled paths grows exponentially with q, and their frequency should be taken into account: this leads to a variation of the Jaccard index known as Bray-Curtis index for multisets. We describe nSimGram, a suite of fast algorithms for node similarity with q-grams, based on a novel blend of color coding, probabilistic counting, sketches, and string algorithms, where the universe of elements to sample is exponential. We provide experimental evidence that our measure is effective and our running times scale to deal with large real-world networks.

Supplemental Material

conte_similarity_q_grams_networks.mp4

mp4

333.5 MB

Download

References

Nesreen K Ahmed, Jennifer Neville, Ryan A Rossi, and Nick Duffield. Efficient graphlet counting for large networks. In Data Mining (ICDM), 2015 IEEE International Conference on, pages 1--10. IEEE, 2015. Google ScholarDigital Library
Alexa. Website traffic, statistics and analytics. https://www.alexa.com/siteinfo/, Accessed October 2017.Google Scholar
Noga Alon and Shai Gutner. Balanced families of perfect hash functions and their applications. ACM Trans. Algorithms, 6(3):54:1--54:12, July 2010. Google ScholarDigital Library
Noga Alon, Raphael Yuster, and Uri Zwick. Color-coding. Journal of the ACM (JACM), 42(4):844--856, 1995. Google ScholarDigital Library
Andrei Z. Broder. Identifying and filtering near-duplicate documents. In Combinatorial Pattern Matching, 11th Annual Symposium, CPM 2000, Montreal, Canada, June 21--23, 2000, Proceedings, pages 1--10, 2000. Google ScholarDigital Library
David Eppstein and Joseph Wang. Fast approximation of centrality. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 228--229. Society for Industrial and Applied Mathematics, 2001. Google ScholarDigital Library
Yuan Fang, Wenqing Lin, Vincent W Zheng, Min Wu, Kevin Chen-Chuan Chang, and Xiao-Li Li. Semantic proximity search on graphs with metagraph-based learning. In Data Engineering (ICDE), 2016 IEEE 32nd International Conference on, pages 277--288. IEEE, 2016.Google ScholarCross Ref
Jörg Flum and Martin Grohe. The parameterized complexity of counting problems. SIAM Journal on Computing, 33(4):892--922, August 2004. Google ScholarDigital Library
Michael R. Garey and David S. Johnson. Computers and intractability : a guide to the theory of NP-completeness /. San Francisco : W. H. Freeman, 1979, 338 p. CALL NUMBER: QA76.6 .G35, 1979. Google ScholarDigital Library
P-L Giscard and RC Wilson. The all-paths and cycles graph kernel. arXiv preprint arXiv:1708.01410, 2017.Google Scholar
IMDb. Imdb datasets. http://www.imdb.com/interfaces/, Accessed October 2017.Google Scholar
Sergey Ioffe. Improved consistent sampling, weighted minhash and l1 sketching. In 10th International Conference on Data Mining (ICDM), pages 246--255. IEEE, 2010. Google ScholarDigital Library
Glen Jeh and Jennifer Widom. Simrank: a measure of structural-context similarity. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538--543. ACM, 2002. Google ScholarDigital Library
Glen Jeh and Jennifer Widom. Scaling personalized web search. In Proceedings of the 12th international conference on World Wide Web, pages 271--279. Acm, 2003. Google ScholarDigital Library
Leo Katz. A new status index derived from sociometric analysis. Psychometrika, 18(1):39--43, 1953.Google ScholarCross Ref
Linyuan Lü and Tao Zhou. Link prediction in complex networks: A survey. Physica A: Statistical Mechanics and its Applications, 390(6):1150 -- 1170, 2011.Google ScholarCross Ref
P. Legendre and L.F.J. Legendre. Numerical Ecology. Developments in Environmental Modelling. Elsevier Science, 1998.Google Scholar
Elizabeth A Leicht, Petter Holme, and Mark EJ Newman. Vertex similarity in networks. Physical Review E, 73(2):026120, 2006.Google ScholarCross Ref
Jure Leskovec and Julian J Mcauley. Learning to discover social circles in ego networks. In Advances in neural information processing systems, pages 539--547, 2012. Google ScholarDigital Library
Weiping Liu and Linyuan Lü. Link prediction based on local random walk. EPL (Europhysics Letters), 89(5):58007, 2010.Google Scholar
Zemin Liu, Vincent W Zheng, Zhou Zhao, Fanwei Zhu, Kevin Chen-Chuan Chang, Minghui Wu, and Jing Ying. Semantic proximity search on heterogeneous graph by proximity embedding. In AAAI, pages 154--160, 2017.Google Scholar
Sascha Rothe and Hinrich Schütze. Cosimrank: A flexible & efficient graph-theoretic similarity measure. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, volume 1, pages 1392--1402, 2014.Google ScholarCross Ref
Chuan Shi, Xiangnan Kong, Yue Huang, S Yu Philip, and Bin Wu. Hetesim: A general framework for relevance measure in heterogeneous networks. IEEE Transactions on Knowledge and Data Engineering, 26(10):2479--2492, 2014.Google ScholarCross Ref
Chuan Shi, Yitong Li, Jiawei Zhang, Yizhou Sun, and S Yu Philip. A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge and Data Engineering, 29(1):17--37, 2017. Google ScholarDigital Library
Chuan Shi, Chong Zhou, Xiangnan Kong, Philip S Yu, Gang Liu, and Bai Wang. Heterecom: a semantic-based recommendation system in heterogeneous networks. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1552--1555. ACM, 2012. Google ScholarDigital Library
SNAP. Netinf. http://snap.stanford.edu/netinf/, Accessed October 2017.Google Scholar
Yizhou Sun, Jiawei Han, Xifeng Yan, Philip S Yu, and Tianyi Wu. Pathsim: Meta path-based top-k similarity search in heterogeneous information networks. Proceedings of the VLDB Endowment, 4(11):992--1003, 2011.Google ScholarDigital Library
The Google Ngram Viewer Team, part of Google Research. Google Books Ngram Viewer. https://books.google.com/ngrams/info, Accessed February 2018.Google Scholar
Hanghang Tong, Christos Faloutsos, and Jia-Yu Pan. Fast random walk with restart and its applications. In Proceedings of the Sixth International Conference on Data Mining, ICDM '06, pages 613--622, Washington, DC, USA, 2006. IEEE Computer Society. Google ScholarDigital Library
Guan Wang, Qingbo Hu, and Philip S Yu. Influence and similarity on heterogeneous networks. In Proceedings of the 21st ACM international conference on Information and knowledge management, pages 1462--1466. ACM, 2012. Google ScholarDigital Library
Wei Wu, Bin Li, Ling Chen, and Chengqi Zhang. Consistent weighted sampling made more practical. In Proceedings of the 26th International Conference on World Wide Web, pages 1035--1043. International World Wide Web Conferences Steering Committee, 2017. Google ScholarDigital Library
Yun Xiong, Yangyong Zhu, and S Yu Philip. Top-k similarity join in heterogeneous information networks. IEEE Transactions on Knowledge and Data Engineering, 27(6):1710--1723, 2015.Google ScholarDigital Library

Index Terms

Node Similarity with q -Grams for Real-World Labeled Networks
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Similarity measures
  2. World Wide Web
    1. Web applications
      1. Social networks
2. Mathematics of computing
  1. Discrete mathematics
    1. Graph theory
      1. Approximation algorithms
      2. Graph algorithms

Recommendations

LSimRank: Node Similarity in a Labeled Graph
Web and Big Data
Abstract
The notion of node similarity is useful in many real-world applications. Many existing similarity measurements such as SimRank and its variants have been proposed. Among these measurements, most capture the structural information of a graph only, ...
Read More
Graph Node Clustering via Transitive Node Similarity
PCI '10: Proceedings of the 2010 14th Panhellenic Conference on Informatics

This paper studies the problem of cluster detection in undirected graphs by using transitive node similarity methods. Well-defined semi-metric measures are proposed to compute the similarity between the nodes of the graph, and the clustering is based on ...
Read More
Measuring in-network node similarity based on neighborhoods: a unified parametric approach

In many applications, we need to measure similarity between nodes in a large network based on features of their neighborhoods. Although in-network node similarity based on proximity has been well investigated, surprisingly, measuring in-network node ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining
July 2018
2925 pages
ISBN:9781450355520
DOI:10.1145/3219819
General Chairs:
Yike Guo
Imperial College London
,
Faisal Farooq
IBM
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 July 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
approximation
color coding
graph algorithms
node similarity
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '18 Paper Acceptance Rate107of983submissions,11%Overall Acceptance Rate1,133of8,635submissions,13%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 8
  Total Citations
  View Citations
- 749
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Node Similarity with q -Grams for Real-World Labeled Networks

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

LSimRank: Node Similarity in a Labeled Graph

Graph Node Clustering via Transitive Node Similarity

Measuring in-network node similarity based on neighborhoods: a unified parametric approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Node Similarity with q -Grams for Real-World Labeled Networks

KDD '18: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

LSimRank: Node Similarity in a Labeled Graph

Graph Node Clustering via Transitive Node Similarity

Measuring in-network node similarity based on neighborhoods: a unified parametric approach

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media