research-article

Accurate fact harvesting from natural language text in wikipedia with Lector

Authors:
Matteo Cannaviccio

Roma Tre University, Rome, Italy

Roma Tre University, Rome, Italy
View Profile

,
Denilson Barbosa

University of Alberta, Edmonton, Canada

University of Alberta, Edmonton, Canada
View Profile

,
Paolo Merialdo

Roma Tre University, Rome, Italy

Roma Tre University, Rome, Italy
View Profile

WebDB '16: Proceedings of the 19th International Workshop on Web and DatabasesJune 2016Article No.: 9Pages 1–6https://doi.org/10.1145/2932194.2932203

Published:26 June 2016Publication History

WebDB '16: Proceedings of the 19th International Workshop on Web and Databases

Pages 1–6

ABSTRACT

Many approaches have been introduced recently to automatically create or augment Knowledge Graphs (KGs) with facts extracted from Wikipedia, particularly its structured components like the infoboxes. Although these structures are valuable, they represent only a fraction of the actual information expressed in the articles. In this work, we quantify the number of highly accurate facts that can be harvested with high precision from the text of Wikipedia articles using information extraction techniques bootstrapped from the entities and relations already in a KG. Our experimental evaluation, which uses Freebase as reference KG, reveals we can augment several relations in the domain of people by more than 10%, with facts whose accuracy are over 95%. Moreover, the vast majority of these facts are missing from the infoboxes, YAGO and DBpedia.

References

E. Agichtein, L. Gravano. Snowball: Extracting relations from large plain-text collections. ACM DL 2000. Google ScholarDigital Library
C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. Dbpedia - a crystallization point for the web of data. Web Semant., 7(3):154--165, Sept. 2009. Google ScholarDigital Library
S. Brin. Extracting patterns and relations from the world wide web. WebDB, 1998. Google ScholarDigital Library
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., T. M. Mitchell. Toward an architecture for never-ending language learning. AAAI, 2010. Google ScholarDigital Library
F. de Sá Mesquita, J. Schmidek, and D. Barbosa. Effectiveness and efficiency of open relation extraction. EMNLP, 2013.Google Scholar
X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. KDD, 2014. Google ScholarDigital Library
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction for the web. IJCAI, 2007. Google ScholarDigital Library
M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. COLING, 1992. Google ScholarDigital Library
E. Hovy, R. Navigli, S. P. Ponzetto. Collaboratively built semi-structured content and artificial intelligence: The story so far. Artif. Intell., 194, Jan. 2013. Google ScholarDigital Library
B. Min, R. Grishman, L. Wan, C. Wang, D. Gondek. Distant supervision for relation extraction with an incomplete knowledge base. NAACL, 2013.Google Scholar
M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. ACL, 2009. Google ScholarDigital Library
N. Nakashole, M. Theobald, and G. Weikum. Scalable knowledge harvesting with high precision and high recall. WSDM, 2011. Google ScholarDigital Library
H. Paulheim. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web, In Press, 2016.Google Scholar
M. Ruiz-Casado, E. Alfonseca, and P. Castells. Automatising the learning of lexical patterns: An application to the enrichment of wordnet by extracting semantic relationships from wikipedia. Data Knowl. Eng., 61(3):484--499, June 2007. Google ScholarDigital Library
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: A large ontology from wikipedia and wordnet. Web Semant., 6(3):203--217, Sept. 2008. Google ScholarDigital Library
F. M. Suchanek, M. Sozio, and G. Weikum. Sofie: A self-organizing framework for information extraction. WWW, 2009. Google ScholarDigital Library
R. West, E. Gabrilovich, K. Murphy, S. Sun, R. Gupta, and D. Lin. Knowledge base completion via search-based question answering. WWW, 2014. Google ScholarDigital Library
F. Wu and D. S. Weld. Autonomously semantifying wikipedia. CIKM, 2007. Google ScholarDigital Library
F. Wu and D. S. Weld. Open information extraction using wikipedia. ACL, 2010. Google ScholarDigital Library
T. P. Tanon, D. Vrandečić, S. Schaffert, T. Steiner, and L. Pintscher. From freebase to wikidata: The great migration. WWW, 2016. Google ScholarDigital Library

Recommendations

Using Wikipedia for cross-language named entity recognition
MSM/MUSE/SenseML'14: Proceedings of the 5th and 1st International Conference on Big Data Analytics in the Social and Ubiquitous Context - 5th International Workshop on Modeling Social Media, 5th International Workshop on Mining Ubiquitous and Social Environments and First International Workshop on Machine Learning for Urban Sensor Data

Named entity recognition and classification (NERC) is fundamental for natural language processing tasks such as information extraction, question answering, and topic detection. State-of-the-art NERC systems are based on supervised machine learning and ...
Read More
Analysing anaphoric ambiguity in natural language requirements
Special Issue on Best Papers of RE'10: Requirements Engineering in a Multi-faceted World

Many requirements documents are written in natural language (NL). However, with the flexibility of NL comes the risk of introducing unwanted ambiguities in the requirements and misunderstandings between stakeholders. In this paper, we describe an ...
Read More
Named entity recognition in Wikipedia
People's Web '09: Proceedings of the 2009 Workshop on The People's Web Meets NLP: Collaboratively Constructed Semantic Resources

Named entity recognition (NER) is used in many domains beyond the newswire text that comprises current gold-standard corpora. Recent work has used Wikipedia's link structure to automatically generate near gold-standard annotations. Until now, these ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

WebDB '16: Proceedings of the 19th International Workshop on Web and Databases
June 2016
59 pages
ISBN:9781450343107
DOI:10.1145/2932194

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
WebDB '16 Paper Acceptance Rate9of29submissions,31%Overall Acceptance Rate30of100submissions,30%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 234
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Accurate fact harvesting from natural language text in wikipedia with Lector

WebDB '16: Proceedings of the 19th International Workshop on Web and Databases

ABSTRACT

References

Cited By

Recommendations

Using Wikipedia for cross-language named entity recognition

Analysing anaphoric ambiguity in natural language requirements

Named entity recognition in Wikipedia

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Accurate fact harvesting from natural language text in wikipedia with Lector

WebDB '16: Proceedings of the 19th International Workshop on Web and Databases

ABSTRACT

References

Cited By

Recommendations

Using Wikipedia for cross-language named entity recognition

Analysing anaphoric ambiguity in natural language requirements

Named entity recognition in Wikipedia

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media