skip to main content
10.1145/1401890.1402008acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

ArnetMiner: extraction and mining of academic social networks

Published:24 August 2008Publication History

ABSTRACT

This paper addresses several key issues in the ArnetMiner system, which aims at extracting and mining academic social networks. Specifically, the system focuses on: 1) Extracting researcher profiles automatically from the Web; 2) Integrating the publication data into the network from existing digital libraries; 3) Modeling the entire academic network; and 4) Providing search services for the academic network. So far, 448,470 researcher profiles have been extracted using a unified tagging approach. We integrate publications from online Web databases and propose a probabilistic framework to deal with the name ambiguity problem. Furthermore, we propose a unified modeling approach to simultaneously model topical aspects of papers, authors, and publication venues. Search services such as expertise search and people association search have been provided based on the modeling results. In this paper, we describe the architecture and main features of the system. We also present the empirical evaluation of the proposed methods.

References

  1. L. A. Adamic and E. Adar. How to search a social network. Social Networks, 27:187--203, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  2. C. Andrieu, N. de Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine Learning, 50:5--43, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  3. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. K. Balog, L. Azzopardi, and M. de Rijke. Formal models for expert finding in enterprise corpora. In Proc. of SIGIR'06, pages 43--55, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Basu, M. Bilenko, and R. J. Mooney. A probabilistic framework for semi-supervised clustering. In Proc. of KDD'04, pages 59--68, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In Proc. of WWW'05, pages 463--470, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. M. Blei and J. D. McAuliffe. Supervised topic models. In Proc. of NIPS'07, 2007.Google ScholarGoogle Scholar
  8. D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Brickley and L. Miller. Foaf vocabulary specification. In Namespace Document, http://xmlns.com/foaf/0.1/, September 2004.Google ScholarGoogle Scholar
  10. C. Buckley and E. M. Voorhees. Retrieval evaluation with incomplete information. In Proc. of SIGIR'04, pages 25--32, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. Ciravegna. An adaptive algorithm for information extraction from web-related texts. In Proc. of IJCAI'01 Workshop, August 2001.Google ScholarGoogle Scholar
  12. C. Cortes and V. Vapnikn. Support-vector networks. Machine Learning, 20:273--297, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Craswell, A. P. de Vries, and I. Soboroff. Overview of the trec-2005 enterprise track. In TREC'05, pages 199--205, 2005.Google ScholarGoogle Scholar
  14. H. Han, L. Giles, H. Zha, C. Li, and K. Tsioutsiouliklis. Two supervised learning approaches for name disambiguation in author citations. In Proc. of JCDL'04, pages 296--305, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. Han, H. Zha, and C. L. Giles. Name disambiguation in author citations using a k-way spectral clustering method. In Proc. of JCDL'05, pages 334--343, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. Hofmann. Collaborative filerting via gaussian probabilistic latent semantic analysis. In Proc.of SIGIR'03, pages 259--266, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. T. Hofmann. Probabilistic latent semantic indexing. In Proc.of SIGIR'99, pages 50--57, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Kautz, B. Selman, and M. Shah. Referral web: Combining social networks and collaborative filtering. Communications of the ACM, 40(3):63--65, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Kristjansson, A. Culotta, P. Viola, and A. McCallum. Interactive information extraction with constrained conditional random fields. In Proc. of AAAI'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML'01, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. McCallum. Multi-label text classification with a mixture model trained by em. In Proc. of AAAI'99 Workshop, 1999.Google ScholarGoogle Scholar
  22. D. Mimno and A. McCallum. Expertise modeling for matching papers with reviewers. In Proc. of KDD'07, pages 500--509, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. Minka. Estimating a dirichlet distribution. In Technique Report, http://research.microsoft.com/ minka/papers/dirichlet/, 2003.Google ScholarGoogle Scholar
  24. Z. Nie, Y. Ma, S. Shi, J.-R. Wen, and W.-Y. Ma. Web object retrieval. In Proc. of WWW'07, pages 81--90, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth. The author-topic model for authors and documents. In Proc. of UAI'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Steyvers, P. Smyth, and T. Griffiths. Probabilistic author-topic models for information discovery. In Proc. of SIGKDD'04, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. F. Tan, M.-Y. Kan, and D. Lee. Search engine driven author disambiguation. In Proc. of JCDL'06, pages 314--315, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Tang, D. Zhang, and L. Yao. Social network extraction of academic researchers. In Proc. of ICDM'07, pages 292--301, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. X. Wei and W. B. Croft. Lda-based document models for ad-hoc retrieval. In Proc. of SIGIR'06, pages 178--185, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. E. Xun, C. Huang, and M. Zhou. A unified statistical model for the identification of english basenp. In Proc. of ACL'00, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. X. Yin, J. Han, and P. Yu. Object distinction: Distinguishing objects with identical names. In Proc. of ICDE'2007, pages 1242--1246, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  32. K. Yu, G. Guan, and M. Zhou. Resume information extraction with cascaded hybrid model. In Proc. of ACL'05, pages 499--506, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. ArnetMiner: extraction and mining of academic social networks

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2008
        1116 pages
        ISBN:9781605581934
        DOI:10.1145/1401890
        • General Chair:
        • Ying Li,
        • Program Chairs:
        • Bing Liu,
        • Sunita Sarawagi

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader