ABSTRACT
Knowledge bases capture millions of entities such as people, companies or movies. However, their knowledge of named events like sports finals, political scandals, or natural disasters is fairly limited, as these are continuously emerging entities. This paper presents a method for extracting named events from news articles, reconciling them into canonicalized representation, and organizing them into fine-grained semantic classes to populate a knowledge base. Our method captures similarity measures among news articles in a multi-view attributed graph, considering textual contents, entity occurrences, and temporal ordering. For distilling canonicalized events from this raw data, we present a novel graph coarsening algorithm based on the information-theoretic principle of minimum description length. The quality of our method is experimentally demonstrated by extracting, organizing, and evaluating 25,000 events from a corpus of 300,000 heterogeneous news articles.
- F. M. Suchanek, et al. Yago: A Core of Semantic Knowledge. WWW, 2007. Google ScholarDigital Library
- M. K. Agarwal, et al. Real Time Discovery of Dense Clusters in Highly Dynamic Graphs: Identifying Real World Events in Highly Dynamic Environments. PVLDB, 2012. Google ScholarDigital Library
- A. Angel, et al. Dense Subgraph Maintenance under Streaming Edge Weight Updates for Real-time Story Identification. PVLDB, 2012. Google ScholarDigital Library
- S. Auer, et al. DBpedia: A Nucleus for a Web of Open Data. ISWC/ASWC, 2007. Google ScholarDigital Library
- A. Das Sarma, et al. Dynamic Relationship and Event Discovery. WSDM, 2011. Google ScholarDigital Library
- Q. Do, et al. Joint Inference for Event Timeline Construction. EMNLP-CoNLL, 2012. Google ScholarDigital Library
- J. R. Finkel, et al. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. ACL, 2005. Google ScholarDigital Library
- E. Gabrilovich, et al. Overcoming the Brittleness Bottleneck Using Wikipedia: Enhancing Text Categorization with Encyclopedic Knowledge. AAAI, 2006. Google ScholarDigital Library
- P. D. Grünwald. The Minimum Description Length Principle. MIT Press, 2007.Google ScholarDigital Library
- X. Hu, et al. Exploiting Wikipedia as External Knowledge for Document Clustering. SIGKDD, 2009. Google ScholarDigital Library
- J. Hoffart, et al. Yago2: A Spatially and Temporally Enhanced Knowledge Base from Wikipedia. Artificial Intelligence, Vol. 194, p:28--61, 2013. Google ScholarDigital Library
- G. Karypis, et al. Multilevel Graph Partitioning Schemes. ICPP, Vol. 3, p:113--122, 1995.Google Scholar
- S. Kirkpatrick, et al. Optimization by Simulated Annealing. Science, Vol. 220(4598), p:671--680, 1983.Google Scholar
- E. Kuzey, et al. Extraction of Temporal Facts and Events from Wikipedia. TempWeb Workshop, 2012. Google ScholarDigital Library
- W. Lu, et al. Automatic Event Extraction with Structured Preference Modeling. ACL, 2012. Google ScholarDigital Library
- S. Navlakha, et al. Graph Summarization with Bounded Error. SIGMOD, 2008. Google ScholarDigital Library
- I. Safro, et al. Advanced Coarsening Schemes for Graph Partitioning. SEA, 2012. Google ScholarDigital Library
- D. Shahaf, et al. Connecting the Dots Between News Articles. KDD, 2010. Google ScholarDigital Library
- A. Silva, et al. Mining Attribute-Structure Correlated Patterns in Large Attributed Graphs. PVLDB, 2012. Google ScholarDigital Library
- Y. Tian, et al. Efficient Aggregation for Graph Summarization. SIGMOD, 2008. Google ScholarDigital Library
- D. Wang, et al. Generating Pictorial Storylines Via Minimum-Weight Connected Dominating Set Approximation in Multi-View Graphs. AAAI, 2012.Google ScholarDigital Library
- P. Wang, et al. Using Wikipedia Knowledge to Improve Text Classification. KAIS, Vol. 19 (3), p:265--281, 2009. Google ScholarDigital Library
- R. Yan, et al. Evolutionary Timeline Summarization: A Balanced Optimization Framework via Iterative Substitution. SIGIR, 2011. Google ScholarDigital Library
- C. Zhai. Statistical Language Models for Information Retrieval. Morgan & Claypool Publishers, 2008. Google ScholarDigital Library
- Y. Zhou, et al. Graph Clustering Based on Structural/Attribute Similarities. PVLDB, 2009. Google ScholarDigital Library
Index Terms
- A Fresh Look on Knowledge Bases: Distilling Named Events from News
Recommendations
Towards Vandalism Detection in Knowledge Bases: Corpus Construction and Analysis
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalWe report on the construction of the Wikidata Vandalism Corpus WDVC-2015, the first corpus for vandalism in knowledge bases. Our corpus is based on the entire revision history of Wikidata, the knowledge base underlying Wikipedia. Among Wikidata's 24 ...
Search-based entity disambiguation with document-centric knowledge bases
i-KNOW '15: Proceedings of the 15th International Conference on Knowledge Technologies and Data-driven BusinessEntity disambiguation is the task of mapping ambiguous terms in natural-language text to its entities in a knowledge base. One possibility to describe these entities within a knowledge base is via entity-annotated documents (document-centric knowledge ...
Integration of large scale knowledge bases using probabilistic graphical models
WSDM '14: Proceedings of the 7th ACM international conference on Web search and data miningOver the recent past, information extraction (IE) systems such as Nell and ReVerb have attained much success in creating large knowledge resources with minimal supervision. But, these resources in general, lack schema information and contain facts with ...
Comments