skip to main content
10.1145/2594538.2594563acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
tutorial

Database principles in information extraction

Published:18 June 2014Publication History

ABSTRACT

Information Extraction commonly refers to the task of populating a relational schema, having predefined underlying semantics, from textual content. This task is pervasive in contemporary computational challenges associated with Big Data. This tutorial gives an overview of the algorithmic concepts and techniques used for performing Information Extraction tasks, and describes some of the declarative frameworks that provide abstractions and infrastructure for programming extractors. In addition, the tutorial highlights opportunities for research impact through principles of data management, illustrates these opportunities through recent work, and proposes directions for future research.

References

  1. J. S. Aitken. Learning information extraction rules: An inductive logic programming approach. In ECAI, pages 355--359. IOS Press, 2002.Google ScholarGoogle Scholar
  2. J. Ajmera, H.-I. Ahn, M. Nagarajan, A. Verma, D. Contractor, S. Dill, and M. Denesuk. A CRM system for social media: challenges and experiences. In WWW, pages 49--58, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Aone and M. Ramos-Santacruz. Rees: A large-scale relation and event extraction system. In ANLP, pages 76--83, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. E. Appelt, J. R. Hobbs, J. Bear, D. J. Israel, and M. Tyson. FASTUS: A finite-state processor for information extraction from real-world text. In IJCAI, pages 1172--1178. Morgan Kaufmann, 1993.Google ScholarGoogle Scholar
  5. D. E. Appelt and B. Onyshkevych. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III, pages 23--30, Baltimore, Maryland, USA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68--79, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Benson, A. Haghighi, and R. Barzilay. Event discovery in social media feeds. In ACL, pages 389--398, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. M. Bikel, S. Miller, R. M. Schwartz, and R. M. Weischedel. Nymble: a high-performance learning name-finder. In ANLP, pages 194--201, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In SIGMOD Conference, pages 175--186. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Bröcheler, L. Mihalkova, and L. Getoor. Probabilistic similarity logic. In UAI, pages 73--82. AUAI Press, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. C. Bunescu and R. J. Mooney. Subsequence kernels for relation extraction. In NIPS, 2005.Google ScholarGoogle Scholar
  12. M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In AAAI/IAAI, pages 328--334. AAAI Press / The MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. F. Chen, X. Feng, C. Re, and M. Wang. Optimizing statistical information extraction programs over evolving text. In ICDE, pages 870--881. IEEE Computer Society, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An algebraic approach to declarative information extraction. In ACL, pages 128--137, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! Long live rule-based information extraction systems! In EMNLP, pages 827--832. ACL, 2013.Google ScholarGoogle Scholar
  16. F. Ciravegna. Adaptive information extraction from text by rule induction and generalisation. In IJCAI, pages 1251--1256. Morgan Kaufmann, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Coden, D. Gruhl, N. Lewis, M. A. Tanenblatt, and J. Terdiman. Spot the drug! An unsupervised pattern matching method to extract drug names from very large clinical corpora. In HISB, pages 33--39. IEEE Computer Society, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Culotta and J. S. Sorensen. Dependency tree kernels for relation extraction. In ACL, pages 423--429. ACL, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Cunningham. GATE, a general architecture for text engineering. Computers and the Humanities, 36(2):223--254, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  20. N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, pages 864--875. Morgan Kaufmann, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Dylla, I. Miliaraki, and M. Theobald. A temporal-probabilistic database model for information extraction. PVLDB, 6(14):1810--1821, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Spanners: a formal framework for information extraction. In PODS, pages 37--48. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Cleaning inconsistencies in information extraction via prioritized repairs. In PODS. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Freitag. Toward general-purpose learning for information extraction. In COLING-ACL, pages 404--408, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. D. Freitag. Machine learning for information extraction in informal domains. Machine Learning, 39(2/3):169--202, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution anomaly detection in distributed systems through unstructured log analysis. In ICDM, pages 149--158, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. S. Ginsburg and X. S. Wang. Regular sequence operations and their use in database queries. J. Comput. Syst. Sci., 56(1):1--26, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Gogate, W. A. Webb, and P. Domingos. Learning efficient Markov networks. In NIPS, pages 748--756. Curran Associates, Inc., 2010.Google ScholarGoogle Scholar
  29. R. Grishman and B. Sundheim. Message understanding conference 6: A brief history. In COLING, pages 466--471, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Hoffmann. Interactive Learning of Relation Extractors with Weak Supervision. PhD thesis, University of Washington, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, pages 541--550. The Association for Computer Linguistics, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. X. D. Huang, Y. Ariki, and M. A. Jack. Hidden Markov models for speech recognition, volume 2004. Edinburgh university press Edinburgh, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. B. Huffman. Learning information extraction patterns from examples. In S. Wermter, E. Riloff, and G. Scheler, editors, Learning for Natural Language Processing, volume 1040 of Lecture Notes in Computer Science, pages 246--260. Springer, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Institute of Electrical and Electronic Engineers and the Open group. The open group base specifications issue 7, 2013. IEEE Std 1003.1, 2013 Edition.Google ScholarGoogle Scholar
  35. H. Isozaki and H. Kazawa. Efficient support vector classifiers for named entity recognition. In COLING, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bull., 29(1):40--48, 2006.Google ScholarGoogle Scholar
  37. A. K. Jha and D. Suciu. Probabilistic databases with MarkoViews. PVLDB, 5(11):1160--1171, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. B. Kimelfeld and C. Ré. Transducing Markov sequences. In PODS, pages 15--26. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. Klein and C. D. Manning. Conditional structure versus conditional estimation in NLP models. In EMNLP, pages 9--16. Association for Computational Linguistics, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. Kok and P. Domingos. Using structural motifs for learning Markov logic networks. In Statistical Relational Artificial Intelligence, volume WS-10-06 of AAAI Workshops. AAAI, 2010.Google ScholarGoogle Scholar
  41. J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. T. R. Leek. Information extraction using hidden Markov models. Master's thesis, UC San Diego, 1997.Google ScholarGoogle Scholar
  43. Y. Li, K. Bontcheva, and H. Cunningham. SVM based learning system for information extraction. In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Computer Science, pages 319--339. Springer, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. X. Ling and D. S. Weld. Temporal information extraction. In AAAI. AAAI Press, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  45. B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. Reiss. Automatic rule refinement for information extraction. PVLDB, 3(1):588--597, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. A. McCallum, D. Freitag, and F. C. N. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML, pages 591--598, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Nagesh, G. Ramakrishnan, L. Chiticariu, R. Krishnamurthy, A. Dharkar, and P. Bhattacharyya. Towards efficient named-entity rule induction for customizability. In EMNLP-CoNLL, pages 128--138. ACL, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. F. Niu, C. Ré, A. Doan, and J. W. Shavlik. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB, 4(6):373--384, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. R. Plamondon and S. N. Srihari. On-line and off-line handwriting recognition: A comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell., 22(1):63--84, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. H. Poon and P. Domingos. Joint inference in information extraction. In AAAI'07: Proceedings of the 22nd national conference on Artificial intelligence, pages 913--918. AAAI Press, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. Pujara, H. Miao, L. Getoor, and W. Cohen. Knowledge graph identification. In International Semantic Web Conference (1), volume 8218 of Lecture Notes in Computer Science, pages 542--557. Springer, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. L. D. Raedt and K. Kersting. Statistical relational learning. In C. Sammut and G. I. Webb, editors, Encyclopedia of Machine Learning, pages 916--924. Springer, 2010.Google ScholarGoogle Scholar
  53. K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. D. Manning. A multi-pass sieve for coreference resolution. In EMNLP, pages 492--501. ACL, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933--942, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1--2):107--136, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In AAAI/IAAI, pages 474--479. AAAI Press / The MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. S. Satpal, S. Bhadra, S. Sellamanickam, R. Rastogi, and P. Sen. Web information extraction using Markov logic networks. In KDD, pages 1406--1414. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033--1044, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1--3):233--272, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. S. Staworko, J. Chomicki, and J. Marcinkowski. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell., 64(2--3):209--246, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. F. M. Suchanek, G. Ifrim, and G. Weikum. Combining linguistic and statistical analysis to extract relations from web documents. In KDD, pages 712--717. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. F. M. Suchanek, M. Sozio, and G. Weikum. SOFIE: a self-organizing framework for information extraction. In WWW, pages 631--640. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. D. Z. Wang, M. J. Franklin, M. N. Garofalakis, J. M. Hellerstein, and M. L. Wick. Hybrid in-database inference for declarative information extraction. In SIGMOD Conference, pages 517--528. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. R. Wisnesky, M. A. Hernández, and L. Popa. Mapping polymorphism. In ICDT, ACM International Conference Proceeding Series, pages 196--208. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman, and J. C. Denny. Application of information technology: Medex: a medication information extraction system for clinical narratives. JAMIA, 17(1):19--24, 2010.Google ScholarGoogle Scholar
  67. D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083--1106, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. C. Zhang, T. Baldwin, H. Ho, B. Kimelfeld, and Y. Li. Adaptive parser-centric text normalization. In ACL (1), pages 1159--1168. The Association for Computer Linguistics, 2013.Google ScholarGoogle Scholar
  69. H. Zhu, S. Raghavan, S. Vaithyanathan, and A. Löser. Navigating the intranet with high precision. In WWW, pages 491--500, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Database principles in information extraction

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
                  June 2014
                  300 pages
                  ISBN:9781450323758
                  DOI:10.1145/2594538
                  • General Chair:
                  • Richard Hull,
                  • Program Chair:
                  • Martin Grohe

                  Copyright © 2014 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 18 June 2014

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • tutorial

                  Acceptance Rates

                  PODS '14 Paper Acceptance Rate22of67submissions,33%Overall Acceptance Rate642of2,707submissions,24%

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader