ABSTRACT
Information Extraction commonly refers to the task of populating a relational schema, having predefined underlying semantics, from textual content. This task is pervasive in contemporary computational challenges associated with Big Data. This tutorial gives an overview of the algorithmic concepts and techniques used for performing Information Extraction tasks, and describes some of the declarative frameworks that provide abstractions and infrastructure for programming extractors. In addition, the tutorial highlights opportunities for research impact through principles of data management, illustrates these opportunities through recent work, and proposes directions for future research.
- J. S. Aitken. Learning information extraction rules: An inductive logic programming approach. In ECAI, pages 355--359. IOS Press, 2002.Google Scholar
- J. Ajmera, H.-I. Ahn, M. Nagarajan, A. Verma, D. Contractor, S. Dill, and M. Denesuk. A CRM system for social media: challenges and experiences. In WWW, pages 49--58, 2013. Google ScholarDigital Library
- C. Aone and M. Ramos-Santacruz. Rees: A large-scale relation and event extraction system. In ANLP, pages 76--83, 2000. Google ScholarDigital Library
- D. E. Appelt, J. R. Hobbs, J. Bear, D. J. Israel, and M. Tyson. FASTUS: A finite-state processor for information extraction from real-world text. In IJCAI, pages 1172--1178. Morgan Kaufmann, 1993.Google Scholar
- D. E. Appelt and B. Onyshkevych. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III, pages 23--30, Baltimore, Maryland, USA, 1998. Google ScholarDigital Library
- M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68--79, 1999. Google ScholarDigital Library
- E. Benson, A. Haghighi, and R. Barzilay. Event discovery in social media feeds. In ACL, pages 389--398, 2011. Google ScholarDigital Library
- D. M. Bikel, S. Miller, R. M. Schwartz, and R. M. Weischedel. Nymble: a high-performance learning name-finder. In ANLP, pages 194--201, 1997. Google ScholarDigital Library
- V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In SIGMOD Conference, pages 175--186. ACM, 2001. Google ScholarDigital Library
- M. Bröcheler, L. Mihalkova, and L. Getoor. Probabilistic similarity logic. In UAI, pages 73--82. AUAI Press, 2010.Google ScholarDigital Library
- R. C. Bunescu and R. J. Mooney. Subsequence kernels for relation extraction. In NIPS, 2005.Google Scholar
- M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In AAAI/IAAI, pages 328--334. AAAI Press / The MIT Press, 1999. Google ScholarDigital Library
- F. Chen, X. Feng, C. Re, and M. Wang. Optimizing statistical information extraction programs over evolving text. In ICDE, pages 870--881. IEEE Computer Society, 2012. Google ScholarDigital Library
- L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An algebraic approach to declarative information extraction. In ACL, pages 128--137, 2010. Google ScholarDigital Library
- L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! Long live rule-based information extraction systems! In EMNLP, pages 827--832. ACL, 2013.Google Scholar
- F. Ciravegna. Adaptive information extraction from text by rule induction and generalisation. In IJCAI, pages 1251--1256. Morgan Kaufmann, 2001. Google ScholarDigital Library
- A. Coden, D. Gruhl, N. Lewis, M. A. Tanenblatt, and J. Terdiman. Spot the drug! An unsupervised pattern matching method to extract drug names from very large clinical corpora. In HISB, pages 33--39. IEEE Computer Society, 2012. Google ScholarDigital Library
- A. Culotta and J. S. Sorensen. Dependency tree kernels for relation extraction. In ACL, pages 423--429. ACL, 2004. Google ScholarDigital Library
- H. Cunningham. GATE, a general architecture for text engineering. Computers and the Humanities, 36(2):223--254, 2002.Google ScholarCross Ref
- N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, pages 864--875. Morgan Kaufmann, 2004. Google ScholarDigital Library
- M. Dylla, I. Miliaraki, and M. Theobald. A temporal-probabilistic database model for information extraction. PVLDB, 6(14):1810--1821, 2013. Google ScholarDigital Library
- R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Spanners: a formal framework for information extraction. In PODS, pages 37--48. ACM, 2013. Google ScholarDigital Library
- R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Cleaning inconsistencies in information extraction via prioritized repairs. In PODS. ACM, 2014. Google ScholarDigital Library
- D. Freitag. Toward general-purpose learning for information extraction. In COLING-ACL, pages 404--408, 1998. Google ScholarDigital Library
- D. Freitag. Machine learning for information extraction in informal domains. Machine Learning, 39(2/3):169--202, 2000. Google ScholarDigital Library
- Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution anomaly detection in distributed systems through unstructured log analysis. In ICDM, pages 149--158, 2009. Google ScholarDigital Library
- S. Ginsburg and X. S. Wang. Regular sequence operations and their use in database queries. J. Comput. Syst. Sci., 56(1):1--26, 1998. Google ScholarDigital Library
- V. Gogate, W. A. Webb, and P. Domingos. Learning efficient Markov networks. In NIPS, pages 748--756. Curran Associates, Inc., 2010.Google Scholar
- R. Grishman and B. Sundheim. Message understanding conference 6: A brief history. In COLING, pages 466--471, 1996. Google ScholarDigital Library
- R. Hoffmann. Interactive Learning of Relation Extractors with Weak Supervision. PhD thesis, University of Washington, 2012. Google ScholarDigital Library
- R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, pages 541--550. The Association for Computer Linguistics, 2011. Google ScholarDigital Library
- X. D. Huang, Y. Ariki, and M. A. Jack. Hidden Markov models for speech recognition, volume 2004. Edinburgh university press Edinburgh, 1990. Google ScholarDigital Library
- S. B. Huffman. Learning information extraction patterns from examples. In S. Wermter, E. Riloff, and G. Scheler, editors, Learning for Natural Language Processing, volume 1040 of Lecture Notes in Computer Science, pages 246--260. Springer, 1995. Google ScholarDigital Library
- Institute of Electrical and Electronic Engineers and the Open group. The open group base specifications issue 7, 2013. IEEE Std 1003.1, 2013 Edition.Google Scholar
- H. Isozaki and H. Kazawa. Efficient support vector classifiers for named entity recognition. In COLING, 2002. Google ScholarDigital Library
- T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bull., 29(1):40--48, 2006.Google Scholar
- A. K. Jha and D. Suciu. Probabilistic databases with MarkoViews. PVLDB, 5(11):1160--1171, 2012. Google ScholarDigital Library
- B. Kimelfeld and C. Ré. Transducing Markov sequences. In PODS, pages 15--26. ACM, 2010. Google ScholarDigital Library
- D. Klein and C. D. Manning. Conditional structure versus conditional estimation in NLP models. In EMNLP, pages 9--16. Association for Computational Linguistics, 2002. Google ScholarDigital Library
- S. Kok and P. Domingos. Using structural motifs for learning Markov logic networks. In Statistical Relational Artificial Intelligence, volume WS-10-06 of AAAI Workshops. AAAI, 2010.Google Scholar
- J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google ScholarDigital Library
- T. R. Leek. Information extraction using hidden Markov models. Master's thesis, UC San Diego, 1997.Google Scholar
- Y. Li, K. Bontcheva, and H. Cunningham. SVM based learning system for information extraction. In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Computer Science, pages 319--339. Springer, 2004. Google ScholarDigital Library
- X. Ling and D. S. Weld. Temporal information extraction. In AAAI. AAAI Press, 2010.Google ScholarCross Ref
- B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. Reiss. Automatic rule refinement for information extraction. PVLDB, 3(1):588--597, 2010. Google ScholarDigital Library
- A. McCallum, D. Freitag, and F. C. N. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML, pages 591--598, 2000. Google ScholarDigital Library
- A. Nagesh, G. Ramakrishnan, L. Chiticariu, R. Krishnamurthy, A. Dharkar, and P. Bhattacharyya. Towards efficient named-entity rule induction for customizability. In EMNLP-CoNLL, pages 128--138. ACL, 2012. Google ScholarDigital Library
- F. Niu, C. Ré, A. Doan, and J. W. Shavlik. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB, 4(6):373--384, 2011. Google ScholarDigital Library
- R. Plamondon and S. N. Srihari. On-line and off-line handwriting recognition: A comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell., 22(1):63--84, 2000. Google ScholarDigital Library
- H. Poon and P. Domingos. Joint inference in information extraction. In AAAI'07: Proceedings of the 22nd national conference on Artificial intelligence, pages 913--918. AAAI Press, 2007. Google ScholarDigital Library
- J. Pujara, H. Miao, L. Getoor, and W. Cohen. Knowledge graph identification. In International Semantic Web Conference (1), volume 8218 of Lecture Notes in Computer Science, pages 542--557. Springer, 2013.Google ScholarDigital Library
- L. D. Raedt and K. Kersting. Statistical relational learning. In C. Sammut and G. I. Webb, editors, Encyclopedia of Machine Learning, pages 916--924. Springer, 2010.Google Scholar
- K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. D. Manning. A multi-pass sieve for coreference resolution. In EMNLP, pages 492--501. ACL, 2010. Google ScholarDigital Library
- F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933--942, 2008. Google ScholarDigital Library
- M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1--2):107--136, 2006. Google ScholarDigital Library
- E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In AAAI/IAAI, pages 474--479. AAAI Press / The MIT Press, 1999. Google ScholarDigital Library
- S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008. Google ScholarDigital Library
- S. Satpal, S. Bhadra, S. Sellamanickam, R. Rastogi, and P. Sen. Web information extraction using Markov logic networks. In KDD, pages 1406--1414. ACM, 2011. Google ScholarDigital Library
- W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033--1044, 2007. Google ScholarDigital Library
- S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1--3):233--272, 1999. Google ScholarDigital Library
- S. Staworko, J. Chomicki, and J. Marcinkowski. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell., 64(2--3):209--246, 2012. Google ScholarDigital Library
- F. M. Suchanek, G. Ifrim, and G. Weikum. Combining linguistic and statistical analysis to extract relations from web documents. In KDD, pages 712--717. ACM, 2006. Google ScholarDigital Library
- F. M. Suchanek, M. Sozio, and G. Weikum. SOFIE: a self-organizing framework for information extraction. In WWW, pages 631--640. ACM, 2009. Google ScholarDigital Library
- D. Z. Wang, M. J. Franklin, M. N. Garofalakis, J. M. Hellerstein, and M. L. Wick. Hybrid in-database inference for declarative information extraction. In SIGMOD Conference, pages 517--528. ACM, 2011. Google ScholarDigital Library
- R. Wisnesky, M. A. Hernández, and L. Popa. Mapping polymorphism. In ICDT, ACM International Conference Proceeding Series, pages 196--208. ACM, 2010. Google ScholarDigital Library
- H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman, and J. C. Denny. Application of information technology: Medex: a medication information extraction system for clinical narratives. JAMIA, 17(1):19--24, 2010.Google Scholar
- D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083--1106, 2003. Google ScholarDigital Library
- C. Zhang, T. Baldwin, H. Ho, B. Kimelfeld, and Y. Li. Adaptive parser-centric text normalization. In ACL (1), pages 1159--1168. The Association for Computer Linguistics, 2013.Google Scholar
- H. Zhu, S. Raghavan, S. Vaithyanathan, and A. Löser. Navigating the intranet with high precision. In WWW, pages 491--500, 2007. Google ScholarDigital Library
Index Terms
- Database principles in information extraction
Recommendations
Cleaning inconsistencies in information extraction via prioritized repairs
PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsThe population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature ...
Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsRule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, ...
A Relational Framework for Information Extraction
Information Extraction commonly refers to the task of populating a relational schema, having predefined underlying semantics, from textual content. This task is pervasive in contemporary computational challenges associated with Big Data. In this article ...
Comments