ABSTRACT
There is a growing wealth of data describing networks of various types, including social networks, physical networks such as transportation or communication networks, and biological networks. At the same time, there is a growing interest in analyzing these networks, in order to uncover general laws that govern their structure and evolution, and patterns and predictive models to develop better policies and practices. However, a fundamental challenge in dealing with this newly available observational data describing networks is that the data is often of dubious quality -- it is noisy and incomplete -- and before any analysis method can be applied, the data must be cleaned, and missing information inferred. In this paper, we introduce the notion of graph identification, which explicitly models the inference of a "cleaned" output network from a noisy input graph. It is this output network that is appropriate for further analysis. We present an illustrative example and use the example to explore the types of inferences involved in graph identification, as well as the challenges and issues involved in combining those inferences. We then present a simple, general approach to combining the inferences in graph identification and experimentally show the utility of our combined approach and how the performance of graph identification is sensitive to the inter-dependencies among these inferences.
- I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data, 1:1--36, 2007. Google ScholarDigital Library
- I. Bhattacharya, S. Godbole, and S. Joshi. Structured entity identification and document categorization: two tasks with one joint model. In Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 25--33, 2008. Google ScholarDigital Library
- M. Bilgic and L. Getoor. Effective label acquisition for collective classification. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 43--51, 2008. Google ScholarDigital Library
- M. Bilgic, G. M. Namata, and L. Getoor. Combining collective classification and link prediction. In ICDMW '07: Proceedings of the Seventh IEEE International Conference on Data Mining Workshops, pages 381--386, Washington, DC, USA, 2007. IEEE Computer Society. Google ScholarDigital Library
- Y. Choi, E. Breck, and C. Cardie. Joint extraction of entities and relations for opinion recognition. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 431--439. Association for Computational Linguistics, 2006. Google ScholarDigital Library
- W. W. Cohen, P. Ravikumar, and S. E. Fienberg. A comparison of string distance metrics for name-matching tasks. In Proceedings of IJCAI-03 Workshop on Information Integration, pages 73--78, August 2003.Google Scholar
- L. Getoor and C. P. Diehl. Link mining: a survey. SIGKDD Explorations Newsletter, 7:3--12, 2005. Google ScholarDigital Library
- L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Machine Learning, 3:679--707, 2003. Google ScholarDigital Library
- J. Leskovec, J. Kleinberg, and C. Faloutsos. Graph evolution: Densification and shrinking diameters. ACM Transactions on Knowledge Discovery Data, 1(1):2, 2007. Google ScholarDigital Library
- D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In Intl. Conf. on Information and Knowledge Management, 2003. Google ScholarDigital Library
- Q. Lu and L. Getoor. Link-based classification. In Proceedings of the International Conference on Machine Learning, 2003.Google Scholar
- L. McDowell, K. M. Gupta, and D. W. Aha. Cautious inference in collective classification. In AAAI, pages 596--601, 2007. Google ScholarDigital Library
- G. M. Namata, P. Sen, M. Bilgic, and L. Getoor. Collective classification for text classification. In M. Sahami and A. Srivastava, editors, Text Mining: Classification, Clustering, and Applications. Taylor and Francis Group, 2009.Google Scholar
- M. J. Rattigan, M. Maier, and D. Jensen. Exploiting network structure for active inference in collective classification. Technical report, University of Massachusetts Amherst, 2007.Google Scholar
- M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62:107--136, 2006. Google ScholarDigital Library
- D. Roth, K. Small, and I. Titov. Sequential learning of classifiers for structured prediction problems. In Artificial Intelligence STAT, April 2009.Google Scholar
- D. Roth and W. Yih. A linear programming formulation for global inference in natural language tasks. In Proc. of CoNLL-2004, pages 1--8. Boston, MA, USA, 2004.Google Scholar
- B. Taskar, A. Pieter, and D. Koller. Discriminative probabilistic models for relational data. In Conf. on Uncertainty in Artificial Intelligence, 2002. Google ScholarDigital Library
- B. Taskar, M.-F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In Advances in Neural Information Processing Systems, 2003.Google Scholar
- I. H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, October 1999. Google ScholarDigital Library
Index Terms
- Identifying graphs from noisy and incomplete data
Recommendations
Identifying graphs from noisy and incomplete data
There is a growing wealth of data describing networks of various types, including social networks, physical networks such as transportation or communication networks, and biological networks. At the same time, there is a growing interest in analyzing ...
Collective graph identification
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data miningData describing networks (communication networks, transaction networks, disease transmission networks, collaboration networks, etc.) is becoming increasingly ubiquitous. While this observational data is useful, it often only hints at the actual ...
Collective Graph Identification
Data describing networks—such as communication networks, transaction networks, disease transmission networks, collaboration networks, etc.—are becoming increasingly available. While observational data can be useful, it often only hints at the actual ...
Comments