ABSTRACT
Wikipedia - the online encyclopedia - has long been used as a source of information for researchers, as well as being a subject of research itself. Wikipedia has been shown to be effective in recommender systems, sentiment analysis, validation and multiple domains in information retrieval. One of the reasons for Wikipedia's popularity among researchers and practitioners is the multiple types of information it contains, which enables practitioners to select the right "tool" for their respective tasks. In addition to its great potential, this multitude of information sources also poses a challenge: which sources of information are best suited for a specific problem and how can different types of data be combined? This tutorial aims to provide a holistic view of Wikipedia's different features - text, links, categories, page views, editing history etc. - and explore the different ways they can be utilized in a machine learning framework. By presenting and contrasting the latest works that utilize Wikipedia in multiple domains, this tutorial aims to increase the awareness among researchers and practitioners in these fields to the benefits of utilizing Wikipedia in their respective domains, in particular to the use of multiple sources of information simultaneously.
- B. Al-Shboul and S.-H. Myaeng. Query phrase expansion using wikipedia in patent class search. In Information Retrieval Technology, pages 115--126. Springer, 2011. Google ScholarDigital Library
- O. Arazy, N. Kumar, and B. Shapira. A theory-driven design framework for social recommender systems. journal of the association for information research article, 2010.Google Scholar
- D. Buscaldi and P. Rosso. Mining knowledge from wikipedia for the question answering task. In Proceedings of the International Conference on Language Resources and Evaluation, pages 727--730, 2006.Google Scholar
- G. Demartini, C. S. Firan, T. Iofciu, and W. Nejdl. Semantically enhanced entity ranking. In J. Bailey, D. Maier, K.-D. Schewe, B. Thalheim, and X. S. Wang, editors, WISE, volume 5175 of Lecture Notes in Computer Science, pages 176--188. Springer, 2008. Google ScholarDigital Library
- E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence, pages 1301--1306, July 2006. Google ScholarDigital Library
- E. Gabrilovich and S. Markovitch. Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research, 34:443--498, 2009. Google ScholarCross Ref
- A. Grappy and B. Grau. Answer type validation in question answering systems. In Adaptivity, Personalization and Fusion of Heterogeneous Information, pages 9--15. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIRE, 2010. Google ScholarDigital Library
- B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. R. Curran. Evaluating entity linking with wikipedia. Artificial intelligence, 194:130--150, 2013. Google ScholarDigital Library
- C.-C. Hsu, Y.-T. Li, Y.-W. Chen, and S.-H. Wu. Query expansion via link analysis of wikipedia for clir. Proceedings of NTCIR-7, pages 125--131, 2008.Google Scholar
- R. Kaptein, P. Serdyukov, A. P. de Vries, and J. Kamps. Entity ranking using wikipedia as a pivot. In J. Huang, N. Koudas, G. J. F. Jones, X. Wu, K. Collins-Thompson, and A. An, editors, CIKM, pages 69--78. ACM, 2010. Google ScholarDigital Library
- G. Katz, N. Ofek, B. Shapira, L. Rokach, and G. Shani. Using wikipedia to boost collaborative filtering techniques. In Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys '11, pages 285--288, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- G. Katz, A. Shtok, O. Kurland, B. Shapira, and L. Rokach. Wikipedia-based query performance prediction. In ACM SIGIR , SIGIR '14, pages 1235--1238, 2014. Google ScholarDigital Library
- L. R. N. O. Y. W. P. B. M. Y. S. B. K. Z. P. M. Kenneth Portier, Greta E. Greer and J. Yen. Understanding topics and sentiment in an online cancer survivor community. JNCI Monographs, 2013.Google Scholar
- M. Koolen, G. Kazai, and N. Craswell. Wikipedia pages as entry points for book search. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, pages 44--53. ACM, 2009. Google ScholarDigital Library
- Y. Li, W. P. R. Luk, K. S. E. Ho, and F. L. K. Chung. Improving weak ad-hoc queries using wikipedia asexternal corpus. In ACM SIGIR, pages 797--798. ACM, 2007. Google ScholarDigital Library
- C. Lu, W. Lam, and Y. Zhang. Twitter user modeling and tweets recommendation based on wikipedia concept graph, 2012.Google Scholar
- V. Maidel, P. Shoval, B. Shapira, and M. Taieb-Maimon. Ontological content-based filtering for personalised newspapers. Online Information Review, 34(5):729--756, 2010.Google ScholarCross Ref
- O. Maimon and L. Rokach. Data mining and knowledge discovery handbook (2nd Edition). Springer-Verlag New York, Inc., New York, NY, USA, 2nd edition, 2010. Google ScholarCross Ref
- E. Menahem, L. Rokach, and Y. Elovici. Combining one-class classifiers via meta learning. In ACM CIKM, CIKM '13, pages 2435--2440, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- D. Milne, O. Medelyan, and I. H. Witten. Mining domain-specific thesauri from wikipedia: A case study. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence, pages 442--448. IEEE Computer Society, 2006. Google ScholarDigital Library
- R. Mirizzi, A. Ragone, T. D. Noia, and E. D. Sciascio. Ranking the linked data: The case of dbpedia. In B. Benatallah, F. Casati, G. Kappel, and G. Rossi, editors, ICWE, volume 6189 of Lecture Notes in Computer Science, pages 337--354. Springer, 2010. Google ScholarDigital Library
- R. Navigli and S. P. Ponzetto. Babelnet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 216--225. Association for Computational Linguistics, 2010. Google ScholarDigital Library
- N. Ofek and L. Rokach. A classifier to determine which wikipedia biographies will be accepted. Journal of the Association for Information Science and Technology, 66(1):213--218, 2015.Google ScholarDigital Library
- A. Pak. Using wikipedia to improve precision of contextual advertising. In Proceedings of the 4th Conference on Human Language Technology: Challenges for Computer Science and Linguistics, LTC'09, pages 533--543, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarDigital Library
- J. Pehcevski, J. A. Thom, A.-M. Vercoustre, and V. Naumovski. Entity ranking in wikipedia: utilising categories, links and topic difficulty prediction. Inf. Retr., 13(5):568--600, 2010. Google ScholarDigital Library
- J. Pehcevski, A.-M. Vercoustre, and J. A. Thom. Exploiting locality of wikipedia links in entity ranking. In C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, and R. W. White, editors, ECIR, volume 4956 of Lecture Notes in Computer Science, pages 258--269. Springer, 2008. Google ScholarDigital Library
- H. Raviv, D. Carmel, and O. Kurland. A ranking framework for entity oriented search using markov random fields. In Proceedings of the 1st Joint International Workshop on Entity-Oriented and Semantic Search, page 1. ACM, 2012. Google ScholarDigital Library
- H. Raviv, O. Kurland, and D. Carmel. The cluster hypothesis for entity oriented search. In ACM SIGIR, pages 841--844. ACM, 2013. Google ScholarDigital Library
- F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor. Recommender Systems Handbook. Springer-Verlag New York, Inc., New York, NY, USA, 1st edition, 2010. Google ScholarCross Ref
- V. Subramaniyaswamy and S. C. Pandian. Effective tag recommendation system based on topic ontology using wikipedia and wordnet. Int. J. Intell. Syst., 27(12):1034--1048, 2012. Google ScholarDigital Library
- A.-M. Vercoustre, J. Pehcevski, and J. A. Thom. Using wikipedia categories and links in entity ranking. In Pre-proceedings of the sixth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2007), 2007.Google Scholar
- M. Vidal, G. V. Menezes, K. Berlt, E. S. de Moura, K. Okada, N. Ziviani, D. Fernandes, and M. Cristo. Selecting keywords to represent web pages using wikipedia information. In Proceedings of the 18th Brazilian Symposium on Multimedia and the Web, pages 375--382. ACM, 2012. Google ScholarDigital Library
- J. Vivaldi, L. A. Cabrera-Diego, G. Sierra, and M. Pozzi. Using wikipedia to validate the terminology found in a corpus of basic textbooks. In LREC, pages 3820--3827, 2012.Google Scholar
- F. Wu and D. S. Weld. Autonomously semantifying wikipedia. In ACM CIKM, pages 41--50. ACM, 2007. Google ScholarDigital Library
- F. Wu and D. S. Weld. Automatically refining the wikipedia infobox ontology. In Proceedings of the 17th international conference on World Wide Web, pages 635--644. ACM, 2008. Google ScholarDigital Library
- Z. Wu, G. Xu, R. Pan, Y. Zhang, Z. Hu, and J. Lu. Leveraging wikipedia concept and category information to enhance contextual advertising. In ACM CIKM, CIKM '11, pages 2105--2108, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- Z. Wu, G. Xu, Y. Zhang, P. Dolog, and C. Lu. An improved contextual advertising matching approach based on wikipedia knowledge. Comput. J., 55(3):277--292, Mar. 2012. Google ScholarDigital Library
- H. Zaragoza, H. Rode, P. Mika, J. Atserias, M. Ciaramita, and G. Attardi. Ranking very many typed entities on wikipedia. In ACM CIKM, CIKM '07, pages 1015--1018, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- L. Zhang, C. Li, J. Liu, and H. Wang. Graph-based text similarity measurement by exploiting wikipedia as background knowledge, 2011.Google Scholar
- W. Zhang, D. Wang, G.-R. Xue, and H. Zha. Advertising keywords recommendation for short-text web pages using wikipedia. ACM Trans. Intell. Syst. Technol., 3(2):36:1--36:25, Feb. 2012. Google ScholarDigital Library
Index Terms
- Exploiting Wikipedia for Information Retrieval Tasks
Recommendations
Query dependent pseudo-relevance feedback based on wikipedia
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrievalPseudo-relevance feedback (PRF) via query-expansion has been proven to be e®ective in many information retrieval (IR) tasks. In most existing work, the top-ranked documents from an initial search are assumed to be relevant and used for PRF. One problem ...
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and CommunicationIn natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Learning multilingual named entity recognition from Wikipedia
We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Comments