skip to main content
article

Link mining: a survey

Published: 01 December 2005 Publication History

Abstract

Many datasets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a single-object type and link type, or richer, heterogeneous networks, in which there may be multiple object and link types (and possibly other semantic information). Examples of homogeneous networks include single mode social networks, such as people connected by friendship links, or the WWW, a collection of linked web pages. Examples of heterogeneous networks include those in medical domains describing patients, diseases, treatments and contacts, or in bibliographic domains describing publications, authors, and venues. Link mining refers to data mining techniques that explicitly consider these links when building predictive or descriptive models of the linked data. Commonly addressed link mining tasks include object ranking, group detection, collective classification, link prediction and subgraph discovery. While network analysis has been studied in depth in particular areas such as social network analysis, hypertext mining, and web analysis, only recently has there been a cross-fertilization of ideas among these different communities. This is an exciting, rapidly expanding area. In this article, we review some of the common emerging themes.

References

[1]
J. Adibi, H. Chalupsky, M. Grobelnik, N. Milic-Frayling, and D. Mladenic. KDD Workshop on Link Analysis and Group Detection. 2004.]]
[2]
J. Adibi, H. Chalupsky, E. Melz, and A. Valente. The KOJAK group finder: Connecting the dots via integrated knowledge-based and statistical reasoning. In Innovative Applications of Artificial Intelligence Conference, 2004.]]
[3]
J. Adibi, M. Grobelnik, D. Mladenic, and P. Pantel. KDD Workshop on Link Discovery: Issues, Approaches and Applications. 2005.]]
[4]
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In International Conference on Very Large Data Bases, pages 487--499, Sept. 1994.]]
[5]
E. M. Airoldi and K. M. Carley. Sampling algorithms for pure network topologies. SIGKDD Explorations, 7(2), 2005.]]
[6]
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In International Conference on Very Large Databases (VLDB), Hong Kong, China, 2002.]]
[7]
A. L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:509--512, 1999.]]
[8]
K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 104--111, 1998.]]
[9]
I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In SIGMOD 2004 Workshop on Research Issues on Data Mining and Knowledge Discovery, June 2004.]]
[10]
I. Bhattacharya and L. Getoor. Entity resolution in graphs. Technical Report 4758, Computer Science Department, University of Maryland, 2005.]]
[11]
I. Bhattacharya and L. Getoor. A Latent dirichlet model for unsupervised entity resolution. In SIAM International Conference on Data Mining, 2006. To Appear.]]
[12]
P. Bonacich. Power and centrality: A family of measures. American Journal of Sociology, 92(5):1170--1182, 1987.]]
[13]
S. P. Borgatti and M. G. Everett. Models of core / periphery structures. Social Networks, 21:375--395, 1999.]]
[14]
P. J. Carrington, J. Scott, and S. Wasserman. Models and Methods in Social Network Analysis. Cambridge University Press, Cambridge, 2005.]]
[15]
D. Chakrabarti. Tools for Large Graph Mining. PhD thesis, School of Computer Science, Carnegie Mellon University, 2005.]]
[16]
S. Chakrabarti. Mining the Web. Morgan Kaufman, 2002.]]
[17]
S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In International World Wide Web Conference (WWW), 1998.]]
[18]
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD International Conference on Management of Data, pages 307--318, 1998.]]
[19]
R. Chellappa and A. Jain. Markov random fields: theory and applications. Academic Press, Boston, 1993.]]
[20]
D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In International Conference on Machine Learning (ICML), pages 167--174. Morgan Kaufmann, San Francisco, CA, 2000.]]
[21]
D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Neural Information Processing Systems 13, 2001.]]
[22]
D. J. Cook and L. B. Holder. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1:231--255, 1994.]]
[23]
D. J. Cook and L. B. Holder. Graph-based data mining. IEEE Intelligent Systems, 15(2):32--41, 2000.]]
[24]
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the world wide web. Artificial Intelligence, 118(1-2):69--114, 2000.]]
[25]
A. Culotta and A. McCallum. Joint deduplication of multiple record types in relational data. In Fourteenth Conference on Information and Knowledge Management (CIKM), 2005.]]
[26]
G. V. Cybenko and J. Srivastava. SIAM Workshop on Link Analysis, Counterterrorism and Privacy. 2004.]]
[27]
L. Dehaspe, H. Toivonen, and R. King. Finding frequent substructures in chemical compounds. In International Conference on Knowledge Discovery and Data Mining, pages 30--36, 1998.]]
[28]
T. Dietterich, L. Getoor, and K. Murphy. ICML Workshop on Statistical Relational Learning and its Connections to Other Fields. 2004.]]
[29]
C. Ding, X. He, P. Husbands, H. Zha, and H. D. Simon. PageRank, HITS and a unified framework for link analysis. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 353--354, 2002.]]
[30]
C. H. Q. Ding. Spectral clustering, 2004. http://crd.lbl.gov/cding/Spectral/.]]
[31]
A. Doan, P. Domingos, and A. Y. Halevy. Learning to match the schemas of data sources: A multistrategy approach. Machine Learning, 50(3), 2003.]]
[32]
A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to map between ontologies on the semantic web. In International World Wide Web Conference, 2002.]]
[33]
P. Domingos and M. Richardson. Markov logic: A unifying framework for statistical relational learning. In ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields, pages 49--54, 2004.]]
[34]
X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In ACM SIGMOD International Conference on Management of Data, pages 85--96, 2005.]]
[35]
S. Donoho, T. Dybala, M. Grobelnik, N. Milic-Frayling, and D. Mladenic. KDD Workshop on Link Analysis for Detecting Complex Behavior. 2003.]]
[36]
S. Dzeroski and H. Blockeel. KDD Workshop on Multi-Relational Data Mining. 2004.]]
[37]
S. Dzeroski and H. Blockeel. KDD Workshop on Multi-Relational Data Mining. 2005.]]
[38]
S. Dzeroski and N. Lavrac, editors. Relational Data Mining. Kluwer, Berlin, 2001.]]
[39]
S. Dzeroski, L. D. Raedt, and S. Wrobel. KDD Workshop on Multi-Relational Data Mining. 2003.]]
[40]
R. Feldman. Link analysis: Current state of the art, 2002.]]
[41]
O. Frank and K. Nowicki. Exploratory statistical analysis of networks. Annals of Discrete Mathematics, 55:349--366, 1993.]]
[42]
T. Frantz and K. M. Carley. A formal characterization of cellular networks. Technical Report CMU-ISRI-05-109, Carnegie Mellon University, 2005.]]
[43]
L. Freeman. Centrality in social networks: Conceptual clarifications. Social Networks, 1:215--239, 1979.]]
[44]
T. Gärtner. Exponential and geometric kernels for graphs. In NIPS Workshop on Unreal Data: Principles of Modeling Nonvectorial Data, 2002.]]
[45]
T. Gärtner. A survey of kernels for structured data. SIGKDD Explorations, 5(1):49--58, 2003.]]
[46]
L. Getoor. Link mining: a new data mining challenge. SIGKDD Explorations, 5(1):84--89, 2003.]]
[47]
L. Getoor. N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 3:679--707, 2003.]]
[48]
L. Getoor and D. Jensen. AAAI Workshop on Learning Statistical Models from Relational Data. AAAI Press, 2000.]]
[49]
L. Getoor and D. Jensen. IJCAI Workshop on Learning Statistical Models from Relational Data. 2003.]]
[50]
D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In ACM Conference on Hypertext and Hypermedia, pages 225--234, 1998.]]
[51]
T. H. Haveliwala. Topic-sensitive PageRank. In International Conference on the World Wide Web (WWW), pages 517--526, 2002.]]
[52]
M. Huisman and T. A. B. Snijders. Statistical analysis of longitudinal network data with changing composition. Sociological Methods and Research, 32:253--287, 2003.]]
[53]
R. Hummel and S. Zucker. On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 267--287, 1983.]]
[54]
A. Inokuchi, T. Washio, and H. Motoda. An Apriori-based algorithm for mining frequent substructures from graph data. In European Conference on Principles and Practice of Knowledge Discovery and Data Mining, pages 13--23, 2000.]]
[55]
G. Jeh and J. Widom. SimRank: A measure of structural-context similarity. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 538--543, 2002.]]
[56]
G. Jeh and J. Widom. Scaling personalized web search. In International Conference on the World Wide Web (WWW), pages 271--279, 2003.]]
[57]
D. Jensen. Statistical challenges to inductive inference in linked data. In Seventh International Workshop on Artificial Intelligence and Statistics, 1999.]]
[58]
D. Jensen and H. Goldberg. AAAI Fall Symposium on AI and Link Analysis. AAAI Press, 1998.]]
[59]
D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM International Conference on Data Mining, April 21--23 2005.]]
[60]
H. Kashima and A. Inokuchi. Kernels for graph classification. In ICDM Workshop on Active Mining, 2002.]]
[61]
C. Kemp, T. L. Griffiths, and J. B. Tenenbaum. Discovering latent classes in relational data. Technical Report AI Memo 2004-019, Massachusetts Institute of Technology, September 2004.]]
[62]
N. Ketkar, L. Holder, and D. Cook. Comparison of graph-based and logic-based multi-relational data mining. SIGKDD Explorations, 7(2), December 2005.]]
[63]
R. D. King, S. H. Muggleton, A. Srinivasan, and M. J. E. Sternberg. Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. National Academy of Sciences, 93(1):438--442, January 1996.]]
[64]
J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999.]]
[65]
A. Knobbe and D. van der Wallen. ECML/PKDD Workshop on Multi-Relational Data Mining. 2001.]]
[66]
J. N. Kok and T. Washio. ECML/PKDD Workshop on Mining Graphs, Trees and Sequences. 2004.]]
[67]
J. Kubica, A. Moore, D. Cohn, and J. Schneider. eGraph: A fast graph-based method for link analysis and queries. In IJCAI 2003 Text-Mining and Link-Analysis Workshop, August 2003.]]
[68]
J. Kubica, A. Moore, and J. Schneider. Tractable group detection on large link data sets. In The Third IEEE International Conference on Data Mining, pages 573--576, 2003.]]
[69]
J. Kubica, A. Moore, J. Schneider, and Y. Yang. Stochastic link and group detection. In Eighteenth National Conference on Artificial Intelligence, pages 798--804. American Association for Artificial Intelligence, 2002.]]
[70]
M. Kuramochi and G. Karypis. Frequent subgraph discovery. In IEEE International Conference on Data Mining, pages 313--320, 2001.]]
[71]
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML-01, 2001.]]
[72]
N. Lavrač and S. Džeroski. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994.]]
[73]
R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1-6):387--401, 2000.]]
[74]
X. Li, P. Morie, and D. Roth. Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special Issue on Semantic Integration, 2005.]]
[75]
D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In International Conference on Information and Knowledge Management (CIKM), pages 556--559, 2003.]]
[76]
Q. Lu and L. Getoor. Link-based classification. In International Conference on Machine Learning, 2003.]]
[77]
A. Madche and S. Staab. Ontology learning for the semantic web. IEEE Intelligent Systems, 16(2):72--79, March/April 2001.]]
[78]
T. Matsuda, T. Horiuchi, H. Motoda, and T. Washio. Extension of graph-based induction for general graph structured data. In PAKDD, pages 420--431, 2000.]]
[79]
S. Muggleton, editor. Inductive Logic Programming. Academic Press, 1992.]]
[80]
J. Neville and D. Jensen. Iterative classification in relational data. In Proc. AAAI-2000 Workshop on Learning Statistical Models from Relational Data. AAAI Press, 2000.]]
[81]
J. Neville and D. Jensen. Leveraging relational auto-correlation with latent group models. In IEEE International Conference on Data Mining (ICDM), 2005.]]
[82]
M. E. J. Newman. Detecting community structure in networks. European Physical Journal B, 38:321--330, 2004.]]
[83]
A. Y. Ng, A. X. Zheng, and M. I. Jordan. Link analysis, eigenvectors and stability. In International Joint Conference on Artificial Intelligence (IJCAI), pages 903--910, 2001.]]
[84]
A. Y. Ng, A. X. Zheng, and M. I. Jordan. Stable algorithms for link analysis. In ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.]]
[85]
S. Nijssen, T. Meinl, and G. Karypis. ECML/PKDD Workshop on Mining Graphs, Trees and Sequences. 2005.]]
[86]
K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455):1077--1087, 2001.]]
[87]
H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext catergorization method using links and incrementally available class information. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 264--271, 2000.]]
[88]
J. O'Madadhain, J. Hutchins, and P. Smyth. Prediction and ranking algorithms for even-based network data. SIGKDD Explorations, 7(2), December 2005.]]
[89]
J. O'Madadhain and P. Smyth. EventRank: A framework for ranking time-varying networks. In KDD Workshop on Link Discovery (LinkKDD): Issues, Approaches and Applications, 2005.]]
[90]
J. O'Madadhain, P. Smyth, and L. Adamic. Learning predictive models for link formation. Presented at the International Sunbelt Social Network Conference, February, 2005.]]
[91]
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.]]
[92]
H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems 15. MIT Press, 2003.]]
[93]
A. Popescul and L. H. Ungar. Statistical relational learning for link prediction. In IJCAI Workshop on Learning Statistical Models from Relational Data, 2003.]]
[94]
L. D. Raedt and T. Washio. ECML/PKDD Workshop on Mining Graphs, Trees and Sequences. 2003.]]
[95]
D. Rafiei and A. O. Mendelzon. What is this page known for? Computing web page reputations. In International World Wide Web Conference (WWW), pages 823--835. North-Holland Publishing Co., 2000.]]
[96]
C. Ramakrishnan, W. Milnor, M. Perry, and A. Sheth. Discovering informative connection subgraphs in multi-relational graphs. SIGKDD Explorations, 7(2), December 2005.]]
[97]
M. Rattigan and D. Jensen. The case for anomalous link discovery. SIGKDD Explorations, 7(2), December 2005.]]
[98]
M. Richardson and P. Domingos. The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank. In Advances in Neural Information Processing Systems 14. MIT Press, 2002.]]
[99]
A. Rosenfeld, R. Hummel, and S. Zucker. Scene labeling by relaxation operations. IEEE Transactions on Systems, Man and Cybernetics, 6(6), 1976.]]
[100]
T. Senator. Link mining applications: Progress and challenges. SIGKDD Explorations, 7(2), 2005.]]
[101]
L. Singh, L. Getoor, and L. Licamele. Pruning social networks using structural properties and descriptive attributes. In International Conference on Data Mining, 2005.]]
[102]
P. Singla and P. Domingos. Multi-relational record linkage. In KDD Workshop on Multi-Relational Data Mining, Seattle, WA, August 2004.]]
[103]
D. Skillicorn and K. Carley. SIAM Workshop on Link Analysis, Counterterrorism and Security. 2005.]]
[104]
A. Srivastava, D. Barbara, H. Kargupta, and V. Kumar. SIAM Workshop on Data Mining for Counterterrorism and Security. 2003.]]
[105]
J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Relevance search and anomaly detection in bipartite graphs. SIGKDD Explorations, 7(2), December 2005.]]
[106]
L. Sweeney. Privacy-enhanced linking. SIGKDD Explorations, 7(2), 2005.]]
[107]
B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proc. of UAI, pages 485--492, Edmonton, Canada, 2002.]]
[108]
B. Taskar, M.-F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In Neural Information Processing Systems Conference, Vancouver, Canada, December 2003.]]
[109]
J. R. Tyler, D. M. Wilkinson, and B. A. Huberman. Email as Spectroscopy: Automated Discovery of Community Structure within Organizations. Kluwer, B. V., Deventer, The Netherlands, The Netherlands, 2003.]]
[110]
X. Wang, N. Mohanty, and A. McCallum. Group and topic discovery from relations and text. In KDD Workshop on Link Discovery, August 2005.]]
[111]
S. Wasserman. Conformity of two sociometric relations. Psychometrika, 52:3--18, 1987.]]
[112]
S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge, 1994.]]
[113]
D. J. Watts and S. H. Strogatz. Collective dynamics of "small-world" networks. Nature, 393:440--442, 1998.]]
[114]
S. White and P. Smyth. Algorithms for estimating relative importance in networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 266--275, 2003.]]
[115]
A. P. Wolfe and D. Jensen. Playing multiple roles: Discovering overlapping roles in social networks. In ICML-04 Workshop on Statistical Relational Learning and its Connections to Other Fields, 2004.]]
[116]
X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. In International Conference on Data Mining, 2002.]]
[117]
K. Yoshida, H. Motoda, and N. Indurkhya. Graph-based induction as a unified learning framework. Journal of Applied Intelligence, 4(3):297--316, July 1994.]]

Cited By

View all
  • (2025)Struct-KGS2S: a structural context-based sequence-to-sequence model for link prediction in knowledge graphsKnowledge and Information Systems10.1007/s10115-025-02364-yOnline publication date: 27-Feb-2025
  • (2024)A review on network representation learning with multi-granularity perspectiveIntelligent Data Analysis10.3233/IDA-22732828:1(3-32)Online publication date: 1-Jan-2024
  • (2024)BERT4FCA: A method for bipartite link prediction using formal concept analysis and BERTPLOS ONE10.1371/journal.pone.030485819:6(e0304858)Online publication date: 5-Jun-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM SIGKDD Explorations Newsletter
ACM SIGKDD Explorations Newsletter  Volume 7, Issue 2
December 2005
152 pages
ISSN:1931-0145
EISSN:1931-0153
DOI:10.1145/1117454
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 December 2005
Published in SIGKDD Volume 7, Issue 2

Check for updates

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)83
  • Downloads (Last 6 weeks)11
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Struct-KGS2S: a structural context-based sequence-to-sequence model for link prediction in knowledge graphsKnowledge and Information Systems10.1007/s10115-025-02364-yOnline publication date: 27-Feb-2025
  • (2024)A review on network representation learning with multi-granularity perspectiveIntelligent Data Analysis10.3233/IDA-22732828:1(3-32)Online publication date: 1-Jan-2024
  • (2024)BERT4FCA: A method for bipartite link prediction using formal concept analysis and BERTPLOS ONE10.1371/journal.pone.030485819:6(e0304858)Online publication date: 5-Jun-2024
  • (2024)Predicting Higher Order Links in Social Interaction NetworksIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.329307511:2(2796-2806)Online publication date: Apr-2024
  • (2024)Fast Distributed Memory Parallel Algorithms for Finding Connected Components in Large Graphs2024 IEEE International Conference on Big Data (BigData)10.1109/BigData62323.2024.10825675(579-588)Online publication date: 15-Dec-2024
  • (2024)Line Graph is the Key: An Exploration of Line Graph on Link Prediction with Social Networks2024 11th International Conference on Behavioural and Social Computing (BESC)10.1109/BESC64747.2024.10780611(1-7)Online publication date: 16-Aug-2024
  • (2024)An Overview of Similarity-Based Methods in Predicting Social Network Links: A Comparative AnalysisIEEE Access10.1109/ACCESS.2024.345050612(120913-120934)Online publication date: 2024
  • (2024)Extending Graph-Based LP Techniques for Enhanced Insights Into Complex Hypergraph NetworksIEEE Access10.1109/ACCESS.2024.338532012(51208-51222)Online publication date: 2024
  • (2024)scMGATGRN: a multiview graph attention network–based method for inferring gene regulatory networks from single-cell transcriptomic dataBriefings in Bioinformatics10.1093/bib/bbae52625:6Online publication date: 17-Oct-2024
  • (2024)Graph neural link predictor based on cycle structureCAAI Transactions on Intelligence Technology10.1049/cit2.12396Online publication date: 12-Nov-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media