tutorial

Exploiting Wikipedia for Information Retrieval Tasks

Authors:
Bracha Shapira

Ben-Gurion University of the Negev, Beer-Sheva, Israel

Ben-Gurion University of the Negev, Beer-Sheva, Israel
View Profile

,
Nir Ofek

Ben-Gurion University of the Negev, Beer-Sheva, Israel

Ben-Gurion University of the Negev, Beer-Sheva, Israel
View Profile

,
Victor Makarenkov

Ben-Gurion University of the Negev, Beer-Sheva, Israel

Ben-Gurion University of the Negev, Beer-Sheva, Israel
View Profile

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalAugust 2015Pages 1137–1140https://doi.org/10.1145/2766462.2767879

Published:09 August 2015Publication History

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 1137–1140

ABSTRACT

Wikipedia - the online encyclopedia - has long been used as a source of information for researchers, as well as being a subject of research itself. Wikipedia has been shown to be effective in recommender systems, sentiment analysis, validation and multiple domains in information retrieval. One of the reasons for Wikipedia's popularity among researchers and practitioners is the multiple types of information it contains, which enables practitioners to select the right "tool" for their respective tasks. In addition to its great potential, this multitude of information sources also poses a challenge: which sources of information are best suited for a specific problem and how can different types of data be combined? This tutorial aims to provide a holistic view of Wikipedia's different features - text, links, categories, page views, editing history etc. - and explore the different ways they can be utilized in a machine learning framework. By presenting and contrasting the latest works that utilize Wikipedia in multiple domains, this tutorial aims to increase the awareness among researchers and practitioners in these fields to the benefits of utilizing Wikipedia in their respective domains, in particular to the use of multiple sources of information simultaneously.

References

B. Al-Shboul and S.-H. Myaeng. Query phrase expansion using wikipedia in patent class search. In Information Retrieval Technology, pages 115--126. Springer, 2011. Google ScholarDigital Library
O. Arazy, N. Kumar, and B. Shapira. A theory-driven design framework for social recommender systems. journal of the association for information research article, 2010.Google Scholar
D. Buscaldi and P. Rosso. Mining knowledge from wikipedia for the question answering task. In Proceedings of the International Conference on Language Resources and Evaluation, pages 727--730, 2006.Google Scholar
G. Demartini, C. S. Firan, T. Iofciu, and W. Nejdl. Semantically enhanced entity ranking. In J. Bailey, D. Maier, K.-D. Schewe, B. Thalheim, and X. S. Wang, editors, WISE, volume 5175 of Lecture Notes in Computer Science, pages 176--188. Springer, 2008. Google ScholarDigital Library
E. Gabrilovich and S. Markovitch. Overcoming the brittleness bottleneck using Wikipedia: Enhancing text categorization with encyclopedic knowledge. In Proceedings of the 21st National Conference on Artificial Intelligence, pages 1301--1306, July 2006. Google ScholarDigital Library
E. Gabrilovich and S. Markovitch. Wikipedia-based semantic interpretation for natural language processing. Journal of Artificial Intelligence Research, 34:443--498, 2009. Google ScholarCross Ref
A. Grappy and B. Grau. Answer type validation in question answering systems. In Adaptivity, Personalization and Fusion of Heterogeneous Information, pages 9--15. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D'INFORMATIQUE DOCUMENTAIRE, 2010. Google ScholarDigital Library
B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. R. Curran. Evaluating entity linking with wikipedia. Artificial intelligence, 194:130--150, 2013. Google ScholarDigital Library
C.-C. Hsu, Y.-T. Li, Y.-W. Chen, and S.-H. Wu. Query expansion via link analysis of wikipedia for clir. Proceedings of NTCIR-7, pages 125--131, 2008.Google Scholar
R. Kaptein, P. Serdyukov, A. P. de Vries, and J. Kamps. Entity ranking using wikipedia as a pivot. In J. Huang, N. Koudas, G. J. F. Jones, X. Wu, K. Collins-Thompson, and A. An, editors, CIKM, pages 69--78. ACM, 2010. Google ScholarDigital Library
G. Katz, N. Ofek, B. Shapira, L. Rokach, and G. Shani. Using wikipedia to boost collaborative filtering techniques. In Proceedings of the Fifth ACM Conference on Recommender Systems, RecSys '11, pages 285--288, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
G. Katz, A. Shtok, O. Kurland, B. Shapira, and L. Rokach. Wikipedia-based query performance prediction. In ACM SIGIR , SIGIR '14, pages 1235--1238, 2014. Google ScholarDigital Library
L. R. N. O. Y. W. P. B. M. Y. S. B. K. Z. P. M. Kenneth Portier, Greta E. Greer and J. Yen. Understanding topics and sentiment in an online cancer survivor community. JNCI Monographs, 2013.Google Scholar
M. Koolen, G. Kazai, and N. Craswell. Wikipedia pages as entry points for book search. In Proceedings of the Second ACM International Conference on Web Search and Data Mining, pages 44--53. ACM, 2009. Google ScholarDigital Library
Y. Li, W. P. R. Luk, K. S. E. Ho, and F. L. K. Chung. Improving weak ad-hoc queries using wikipedia asexternal corpus. In ACM SIGIR, pages 797--798. ACM, 2007. Google ScholarDigital Library
C. Lu, W. Lam, and Y. Zhang. Twitter user modeling and tweets recommendation based on wikipedia concept graph, 2012.Google Scholar
V. Maidel, P. Shoval, B. Shapira, and M. Taieb-Maimon. Ontological content-based filtering for personalised newspapers. Online Information Review, 34(5):729--756, 2010.Google ScholarCross Ref
O. Maimon and L. Rokach. Data mining and knowledge discovery handbook (2nd Edition). Springer-Verlag New York, Inc., New York, NY, USA, 2nd edition, 2010. Google ScholarCross Ref
E. Menahem, L. Rokach, and Y. Elovici. Combining one-class classifiers via meta learning. In ACM CIKM, CIKM '13, pages 2435--2440, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
D. Milne, O. Medelyan, and I. H. Witten. Mining domain-specific thesauri from wikipedia: A case study. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence, pages 442--448. IEEE Computer Society, 2006. Google ScholarDigital Library
R. Mirizzi, A. Ragone, T. D. Noia, and E. D. Sciascio. Ranking the linked data: The case of dbpedia. In B. Benatallah, F. Casati, G. Kappel, and G. Rossi, editors, ICWE, volume 6189 of Lecture Notes in Computer Science, pages 337--354. Springer, 2010. Google ScholarDigital Library
R. Navigli and S. P. Ponzetto. Babelnet: Building a very large multilingual semantic network. In Proceedings of the 48th annual meeting of the association for computational linguistics, pages 216--225. Association for Computational Linguistics, 2010. Google ScholarDigital Library
N. Ofek and L. Rokach. A classifier to determine which wikipedia biographies will be accepted. Journal of the Association for Information Science and Technology, 66(1):213--218, 2015.Google ScholarDigital Library
A. Pak. Using wikipedia to improve precision of contextual advertising. In Proceedings of the 4th Conference on Human Language Technology: Challenges for Computer Science and Linguistics, LTC'09, pages 533--543, Berlin, Heidelberg, 2011. Springer-Verlag. Google ScholarDigital Library
J. Pehcevski, J. A. Thom, A.-M. Vercoustre, and V. Naumovski. Entity ranking in wikipedia: utilising categories, links and topic difficulty prediction. Inf. Retr., 13(5):568--600, 2010. Google ScholarDigital Library
J. Pehcevski, A.-M. Vercoustre, and J. A. Thom. Exploiting locality of wikipedia links in entity ranking. In C. Macdonald, I. Ounis, V. Plachouras, I. Ruthven, and R. W. White, editors, ECIR, volume 4956 of Lecture Notes in Computer Science, pages 258--269. Springer, 2008. Google ScholarDigital Library
H. Raviv, D. Carmel, and O. Kurland. A ranking framework for entity oriented search using markov random fields. In Proceedings of the 1st Joint International Workshop on Entity-Oriented and Semantic Search, page 1. ACM, 2012. Google ScholarDigital Library
H. Raviv, O. Kurland, and D. Carmel. The cluster hypothesis for entity oriented search. In ACM SIGIR, pages 841--844. ACM, 2013. Google ScholarDigital Library
F. Ricci, L. Rokach, B. Shapira, and P. B. Kantor. Recommender Systems Handbook. Springer-Verlag New York, Inc., New York, NY, USA, 1st edition, 2010. Google ScholarCross Ref
V. Subramaniyaswamy and S. C. Pandian. Effective tag recommendation system based on topic ontology using wikipedia and wordnet. Int. J. Intell. Syst., 27(12):1034--1048, 2012. Google ScholarDigital Library
A.-M. Vercoustre, J. Pehcevski, and J. A. Thom. Using wikipedia categories and links in entity ranking. In Pre-proceedings of the sixth International Workshop of the Initiative for the Evaluation of XML Retrieval (INEX 2007), 2007.Google Scholar
M. Vidal, G. V. Menezes, K. Berlt, E. S. de Moura, K. Okada, N. Ziviani, D. Fernandes, and M. Cristo. Selecting keywords to represent web pages using wikipedia information. In Proceedings of the 18th Brazilian Symposium on Multimedia and the Web, pages 375--382. ACM, 2012. Google ScholarDigital Library
J. Vivaldi, L. A. Cabrera-Diego, G. Sierra, and M. Pozzi. Using wikipedia to validate the terminology found in a corpus of basic textbooks. In LREC, pages 3820--3827, 2012.Google Scholar
F. Wu and D. S. Weld. Autonomously semantifying wikipedia. In ACM CIKM, pages 41--50. ACM, 2007. Google ScholarDigital Library
F. Wu and D. S. Weld. Automatically refining the wikipedia infobox ontology. In Proceedings of the 17th international conference on World Wide Web, pages 635--644. ACM, 2008. Google ScholarDigital Library
Z. Wu, G. Xu, R. Pan, Y. Zhang, Z. Hu, and J. Lu. Leveraging wikipedia concept and category information to enhance contextual advertising. In ACM CIKM, CIKM '11, pages 2105--2108, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
Z. Wu, G. Xu, Y. Zhang, P. Dolog, and C. Lu. An improved contextual advertising matching approach based on wikipedia knowledge. Comput. J., 55(3):277--292, Mar. 2012. Google ScholarDigital Library
H. Zaragoza, H. Rode, P. Mika, J. Atserias, M. Ciaramita, and G. Attardi. Ranking very many typed entities on wikipedia. In ACM CIKM, CIKM '07, pages 1015--1018, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
L. Zhang, C. Li, J. Liu, and H. Wang. Graph-based text similarity measurement by exploiting wikipedia as background knowledge, 2011.Google Scholar
W. Zhang, D. Wang, G.-R. Xue, and H. Zha. Advertising keywords recommendation for short-text web pages using wikipedia. ACM Trans. Intell. Syst. Technol., 3(2):36:1--36:25, Feb. 2012. Google ScholarDigital Library

Index Terms

Exploiting Wikipedia for Information Retrieval Tasks
1. Information systems
  1. Information retrieval

Recommendations

Query dependent pseudo-relevance feedback based on wikipedia
SIGIR '09: Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval

Pseudo-relevance feedback (PRF) via query-expansion has been proven to be e®ective in many information retrieval (IR) tasks. In most existing work, the top-ranked documents from an initial search are assumed to be relevant and used for PRF. One problem ...
Read More
Two-stage approach to named entity recognition using Wikipedia and DBpedia
IMCOM '17: Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication

In natural language understanding, extraction of named entity (NE) mentions in given text and classification of the mentions into pre-defined NE types are important processes. Most NE recognition (NER) relies on resources such as a training corpus or NE ...
Read More
Learning multilingual named entity recognition from Wikipedia

We automatically create enormous, free and multilingual silver-standard training annotations for named entity recognition (ner) by exploiting the text and structure of Wikipedia. Most ner systems rely on statistical models of annotated data to identify ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval
August 2015
1198 pages
ISBN:9781450336215
DOI:10.1145/2766462
General Chair:
Ricardo Baeza-Yates
Yahoo Labs, USA
,
Program Chairs:
Mounia Lalmas
Yahoo Labs, UK
,
Alistair Moffat
University of Melbourne, Australia
,
Berthier Ribeiro-Neto
Google, Brazil, and UFMG, Brazil
Copyright © 2015 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 August 2015
Check for updates
Author Tags
information retrieval
machine learning
wikipedia
Qualifiers
- tutorial
Conference

Acceptance Rates
SIGIR '15 Paper Acceptance Rate70of351submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 431
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Exploiting Wikipedia for Information Retrieval Tasks

SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Query dependent pseudo-relevance feedback based on wikipedia

Two-stage approach to named entity recognition using Wikipedia and DBpedia

Learning multilingual named entity recognition from Wikipedia