ABSTRACT
Complex information extraction (IE) pipelines are becoming an integral component of most text processing frameworks. We introduce a first system to help IE users analyze extraction pipeline semantics and operator transformations interactively while debugging. This allows the effort to be proportional to the need, and to focus on the portions of the pipeline under the greatest suspicion. We present a generic debugger for running post-execution analysis of any IE pipeline consisting of arbitrary types of operators. For this, we propose an effective provenance model for IE pipelines which captures a variety of operator types, ranging from those for which full to no specifications are available. We have evaluated our proposed algorithms and provenance model on large-scale real-world extraction pipelines.
- E. Agichtein and L. Gravano. Snowball: Extracting relations from large plain-text collections. In DL, 2000. Google ScholarDigital Library
- O. Benjelloun, A. Das Sarma, A. Halevy, and J. Widom. ULDBs: Databases with uncertainty and lineage. In VLDB, 2006. Google ScholarDigital Library
- A. Chapman and H. V. Jagadish. Understanding provenance black boxes. Distributed and Parallel Databases, 27(2), Apr. 2010. Google ScholarDigital Library
- J. Cheney, L. Chiticariu, and W.-C. Tan. Provenance in databases: Why, how, and where. Foundations and Trends in databases, 1, 2009. Google ScholarDigital Library
- J. Huang, T. Chen, A. Doan, and J. F. Naughton. On the provenance of non-answers to queries over extracted data. PVLDB, 1(1), 2008. Google ScholarDigital Library
- R. Ikeda and J. Widom. Data lineage: A survey. Technical report, Stanford University, 2009.Google Scholar
- A. Jain, A. Doan, and L. Gravano. Optimizing SQL queries over text databases. In ICDE, 2008. Google ScholarDigital Library
- A. Jain and P. G. Ipeirotis. A quality-aware optimizer for information extraction. ACM Transactions on Database Systems, 2009. Google ScholarDigital Library
- A. Jain, P. G. Ipeirotis, A. Doan, and L. Gravano. Join optimization of information extraction output: Quality matters! Technical Report CeDER-08-04, New York University, 2008.Google Scholar
- A. Jain and D. Srivastava. Exploring a few good tuples from text databases. In ICDE, 2009. Google ScholarDigital Library
- G. Kasneci, S. Elbassuoni, and G. Weikum. Ming: mining informative entity relationship subgraphs. In CIKM, 2009. Google ScholarDigital Library
- M. Paşca, D. Lin, J. Bigham, A. Lifchits, and A. Jain. Organizing and searching the world wide web of facts - step one: The one-million fact extraction challenge. In Proceedings of AAAI'06, 2006. Google ScholarDigital Library
- P. Pantel and M. Pennacchiotti. Espresso: leveraging generic patterns for automatically harvesting semantic relations. In Proc. of ACL, 2006. Google ScholarDigital Library
- A. D. Sarma, A. Jain, and D. Srivastava. I4E: Interactive investigation of iterative information extraction. In SIGMOD, 2010. Google ScholarDigital Library
- W. Shen, A. Doan, J. Naughton, and R. Ramakrishnan. Towards best-effort information extraction. In SIGMOD, 2008. Google ScholarDigital Library
- W.-C. Tan. Provenance in Databases: Past, Current, and Future. IEEE Data Engineering Bulletin, 2008.Google Scholar
Index Terms
- Building a generic debugger for information extraction pipelines
Recommendations
Constructing efficient information extraction pipelines
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementInformation Extraction (IE) pipelines analyze text through several stages. The pipeline's algorithms determine both its effectiveness and its run-time efficiency. In real-world tasks, however, IE pipelines often fail acceptable run-times because they ...
A Flexible Text Mining System for Entity and Relation Extraction in PubMed
DTMBIO '15: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical InformaticsDue to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means ...
The SystemT IDE: an integrated development environment for information extraction rules
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataInformation Extraction (IE)-the problem of extracting structured information from unstructured text - has become the key enabler for many enterprise applications such as semantic search, business analytics and regulatory compliance. While rule-based IE ...
Comments