tutorial

Database principles in information extraction

Author:
Benny Kimelfeld

LogicBlox, Inc., Berkeley, CA, USA

LogicBlox, Inc., Berkeley, CA, USA
View Profile

PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsJune 2014Pages 156–163https://doi.org/10.1145/2594538.2594563

Published:18 June 2014Publication History

PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

Pages 156–163

ABSTRACT

Information Extraction commonly refers to the task of populating a relational schema, having predefined underlying semantics, from textual content. This task is pervasive in contemporary computational challenges associated with Big Data. This tutorial gives an overview of the algorithmic concepts and techniques used for performing Information Extraction tasks, and describes some of the declarative frameworks that provide abstractions and infrastructure for programming extractors. In addition, the tutorial highlights opportunities for research impact through principles of data management, illustrates these opportunities through recent work, and proposes directions for future research.

References

J. S. Aitken. Learning information extraction rules: An inductive logic programming approach. In ECAI, pages 355--359. IOS Press, 2002.Google Scholar
J. Ajmera, H.-I. Ahn, M. Nagarajan, A. Verma, D. Contractor, S. Dill, and M. Denesuk. A CRM system for social media: challenges and experiences. In WWW, pages 49--58, 2013. Google ScholarDigital Library
C. Aone and M. Ramos-Santacruz. Rees: A large-scale relation and event extraction system. In ANLP, pages 76--83, 2000. Google ScholarDigital Library
D. E. Appelt, J. R. Hobbs, J. Bear, D. J. Israel, and M. Tyson. FASTUS: A finite-state processor for information extraction from real-world text. In IJCAI, pages 1172--1178. Morgan Kaufmann, 1993.Google Scholar
D. E. Appelt and B. Onyshkevych. The common pattern specification language. In Proceedings of the TIPSTER Text Program: Phase III, pages 23--30, Baltimore, Maryland, USA, 1998. Google ScholarDigital Library
M. Arenas, L. E. Bertossi, and J. Chomicki. Consistent query answers in inconsistent databases. In PODS, pages 68--79, 1999. Google ScholarDigital Library
E. Benson, A. Haghighi, and R. Barzilay. Event discovery in social media feeds. In ACL, pages 389--398, 2011. Google ScholarDigital Library
D. M. Bikel, S. Miller, R. M. Schwartz, and R. M. Weischedel. Nymble: a high-performance learning name-finder. In ANLP, pages 194--201, 1997. Google ScholarDigital Library
V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In SIGMOD Conference, pages 175--186. ACM, 2001. Google ScholarDigital Library
M. Bröcheler, L. Mihalkova, and L. Getoor. Probabilistic similarity logic. In UAI, pages 73--82. AUAI Press, 2010.Google ScholarDigital Library
R. C. Bunescu and R. J. Mooney. Subsequence kernels for relation extraction. In NIPS, 2005.Google Scholar
M. E. Califf and R. J. Mooney. Relational learning of pattern-match rules for information extraction. In AAAI/IAAI, pages 328--334. AAAI Press / The MIT Press, 1999. Google ScholarDigital Library
F. Chen, X. Feng, C. Re, and M. Wang. Optimizing statistical information extraction programs over evolving text. In ICDE, pages 870--881. IEEE Computer Society, 2012. Google ScholarDigital Library
L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An algebraic approach to declarative information extraction. In ACL, pages 128--137, 2010. Google ScholarDigital Library
L. Chiticariu, Y. Li, and F. R. Reiss. Rule-based information extraction is dead! Long live rule-based information extraction systems! In EMNLP, pages 827--832. ACL, 2013.Google Scholar
F. Ciravegna. Adaptive information extraction from text by rule induction and generalisation. In IJCAI, pages 1251--1256. Morgan Kaufmann, 2001. Google ScholarDigital Library
A. Coden, D. Gruhl, N. Lewis, M. A. Tanenblatt, and J. Terdiman. Spot the drug! An unsupervised pattern matching method to extract drug names from very large clinical corpora. In HISB, pages 33--39. IEEE Computer Society, 2012. Google ScholarDigital Library
A. Culotta and J. S. Sorensen. Dependency tree kernels for relation extraction. In ACL, pages 423--429. ACL, 2004. Google ScholarDigital Library
H. Cunningham. GATE, a general architecture for text engineering. Computers and the Humanities, 36(2):223--254, 2002.Google ScholarCross Ref
N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, pages 864--875. Morgan Kaufmann, 2004. Google ScholarDigital Library
M. Dylla, I. Miliaraki, and M. Theobald. A temporal-probabilistic database model for information extraction. PVLDB, 6(14):1810--1821, 2013. Google ScholarDigital Library
R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Spanners: a formal framework for information extraction. In PODS, pages 37--48. ACM, 2013. Google ScholarDigital Library
R. Fagin, B. Kimelfeld, F. Reiss, and S. Vansummeren. Cleaning inconsistencies in information extraction via prioritized repairs. In PODS. ACM, 2014. Google ScholarDigital Library
D. Freitag. Toward general-purpose learning for information extraction. In COLING-ACL, pages 404--408, 1998. Google ScholarDigital Library
D. Freitag. Machine learning for information extraction in informal domains. Machine Learning, 39(2/3):169--202, 2000. Google ScholarDigital Library
Q. Fu, J.-G. Lou, Y. Wang, and J. Li. Execution anomaly detection in distributed systems through unstructured log analysis. In ICDM, pages 149--158, 2009. Google ScholarDigital Library
S. Ginsburg and X. S. Wang. Regular sequence operations and their use in database queries. J. Comput. Syst. Sci., 56(1):1--26, 1998. Google ScholarDigital Library
V. Gogate, W. A. Webb, and P. Domingos. Learning efficient Markov networks. In NIPS, pages 748--756. Curran Associates, Inc., 2010.Google Scholar
R. Grishman and B. Sundheim. Message understanding conference 6: A brief history. In COLING, pages 466--471, 1996. Google ScholarDigital Library
R. Hoffmann. Interactive Learning of Relation Extractors with Weak Supervision. PhD thesis, University of Washington, 2012. Google ScholarDigital Library
R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In ACL, pages 541--550. The Association for Computer Linguistics, 2011. Google ScholarDigital Library
X. D. Huang, Y. Ariki, and M. A. Jack. Hidden Markov models for speech recognition, volume 2004. Edinburgh university press Edinburgh, 1990. Google ScholarDigital Library
S. B. Huffman. Learning information extraction patterns from examples. In S. Wermter, E. Riloff, and G. Scheler, editors, Learning for Natural Language Processing, volume 1040 of Lecture Notes in Computer Science, pages 246--260. Springer, 1995. Google ScholarDigital Library
Institute of Electrical and Electronic Engineers and the Open group. The open group base specifications issue 7, 2013. IEEE Std 1003.1, 2013 Edition.Google Scholar
H. Isozaki and H. Kazawa. Efficient support vector classifiers for named entity recognition. In COLING, 2002. Google ScholarDigital Library
T. S. Jayram, R. Krishnamurthy, S. Raghavan, S. Vaithyanathan, and H. Zhu. Avatar information extraction system. IEEE Data Eng. Bull., 29(1):40--48, 2006.Google Scholar
A. K. Jha and D. Suciu. Probabilistic databases with MarkoViews. PVLDB, 5(11):1160--1171, 2012. Google ScholarDigital Library
B. Kimelfeld and C. Ré. Transducing Markov sequences. In PODS, pages 15--26. ACM, 2010. Google ScholarDigital Library
D. Klein and C. D. Manning. Conditional structure versus conditional estimation in NLP models. In EMNLP, pages 9--16. Association for Computational Linguistics, 2002. Google ScholarDigital Library
S. Kok and P. Domingos. Using structural motifs for learning Markov logic networks. In Statistical Relational Artificial Intelligence, volume WS-10-06 of AAAI Workshops. AAAI, 2010.Google Scholar
J. D. Lafferty, A. McCallum, and F. C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, pages 282--289, 2001. Google ScholarDigital Library
T. R. Leek. Information extraction using hidden Markov models. Master's thesis, UC San Diego, 1997.Google Scholar
Y. Li, K. Bontcheva, and H. Cunningham. SVM based learning system for information extraction. In Deterministic and Statistical Methods in Machine Learning, volume 3635 of Lecture Notes in Computer Science, pages 319--339. Springer, 2004. Google ScholarDigital Library
X. Ling and D. S. Weld. Temporal information extraction. In AAAI. AAAI Press, 2010.Google ScholarCross Ref
B. Liu, L. Chiticariu, V. Chu, H. V. Jagadish, and F. Reiss. Automatic rule refinement for information extraction. PVLDB, 3(1):588--597, 2010. Google ScholarDigital Library
A. McCallum, D. Freitag, and F. C. N. Pereira. Maximum entropy Markov models for information extraction and segmentation. In ICML, pages 591--598, 2000. Google ScholarDigital Library
A. Nagesh, G. Ramakrishnan, L. Chiticariu, R. Krishnamurthy, A. Dharkar, and P. Bhattacharyya. Towards efficient named-entity rule induction for customizability. In EMNLP-CoNLL, pages 128--138. ACL, 2012. Google ScholarDigital Library
F. Niu, C. Ré, A. Doan, and J. W. Shavlik. Tuffy: Scaling up statistical inference in Markov logic networks using an RDBMS. PVLDB, 4(6):373--384, 2011. Google ScholarDigital Library
R. Plamondon and S. N. Srihari. On-line and off-line handwriting recognition: A comprehensive survey. IEEE Trans. Pattern Anal. Mach. Intell., 22(1):63--84, 2000. Google ScholarDigital Library
H. Poon and P. Domingos. Joint inference in information extraction. In AAAI'07: Proceedings of the 22nd national conference on Artificial intelligence, pages 913--918. AAAI Press, 2007. Google ScholarDigital Library
J. Pujara, H. Miao, L. Getoor, and W. Cohen. Knowledge graph identification. In International Semantic Web Conference (1), volume 8218 of Lecture Notes in Computer Science, pages 542--557. Springer, 2013.Google ScholarDigital Library
L. D. Raedt and K. Kersting. Statistical relational learning. In C. Sammut and G. I. Webb, editors, Encyclopedia of Machine Learning, pages 916--924. Springer, 2010.Google Scholar
K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky, and C. D. Manning. A multi-pass sieve for coreference resolution. In EMNLP, pages 492--501. ACL, 2010. Google ScholarDigital Library
F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. An algebraic approach to rule-based information extraction. In ICDE, pages 933--942, 2008. Google ScholarDigital Library
M. Richardson and P. Domingos. Markov logic networks. Machine Learning, 62(1--2):107--136, 2006. Google ScholarDigital Library
E. Riloff and R. Jones. Learning dictionaries for information extraction by multi-level bootstrapping. In AAAI/IAAI, pages 474--479. AAAI Press / The MIT Press, 1999. Google ScholarDigital Library
S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008. Google ScholarDigital Library
S. Satpal, S. Bhadra, S. Sellamanickam, R. Rastogi, and P. Sen. Web information extraction using Markov logic networks. In KDD, pages 1406--1414. ACM, 2011. Google ScholarDigital Library
W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. Declarative information extraction using datalog with embedded extraction predicates. In VLDB, pages 1033--1044, 2007. Google ScholarDigital Library
S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1--3):233--272, 1999. Google ScholarDigital Library
S. Staworko, J. Chomicki, and J. Marcinkowski. Prioritized repairing and consistent query answering in relational databases. Ann. Math. Artif. Intell., 64(2--3):209--246, 2012. Google ScholarDigital Library
F. M. Suchanek, G. Ifrim, and G. Weikum. Combining linguistic and statistical analysis to extract relations from web documents. In KDD, pages 712--717. ACM, 2006. Google ScholarDigital Library
F. M. Suchanek, M. Sozio, and G. Weikum. SOFIE: a self-organizing framework for information extraction. In WWW, pages 631--640. ACM, 2009. Google ScholarDigital Library
D. Z. Wang, M. J. Franklin, M. N. Garofalakis, J. M. Hellerstein, and M. L. Wick. Hybrid in-database inference for declarative information extraction. In SIGMOD Conference, pages 517--528. ACM, 2011. Google ScholarDigital Library
R. Wisnesky, M. A. Hernández, and L. Popa. Mapping polymorphism. In ICDT, ACM International Conference Proceeding Series, pages 196--208. ACM, 2010. Google ScholarDigital Library
H. Xu, S. P. Stenner, S. Doan, K. B. Johnson, L. R. Waitman, and J. C. Denny. Application of information technology: Medex: a medication information extraction system for clinical narratives. JAMIA, 17(1):19--24, 2010.Google Scholar
D. Zelenko, C. Aone, and A. Richardella. Kernel methods for relation extraction. Journal of Machine Learning Research, 3:1083--1106, 2003. Google ScholarDigital Library
C. Zhang, T. Baldwin, H. Ho, B. Kimelfeld, and Y. Li. Adaptive parser-centric text normalization. In ACL (1), pages 1159--1168. The Association for Computer Linguistics, 2013.Google Scholar
H. Zhu, S. Raghavan, S. Vaithyanathan, and A. Löser. Navigating the intranet with high precision. In WWW, pages 491--500, 2007. Google ScholarDigital Library

Index Terms

Recommendations

Cleaning inconsistencies in information extraction via prioritized repairs
PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

The population of a predefined relational schema from textual content, commonly known as Information Extraction (IE), is a pervasive task in contemporary computational challenges associated with Big Data. Since the textual content varies widely in nature ...
Read More
Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Rule-based information extraction has lately received a fair amount of attention from the database community, with several languages appearing in the last few years. Although information extraction systems are intended to deal with semistructured data, ...
Read More
A Relational Framework for Information Extraction

Information Extraction commonly refers to the task of populating a relational schema, having predefined underlying semantics, from textual content. This task is pervasive in contemporary computational challenges associated with Big Data. In this article ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2014
300 pages
ISBN:9781450323758
DOI:10.1145/2594538
General Chair:
Richard Hull
IBM T.J. Watson Research Center, USA
,
Program Chair:
Martin Grohe
RWTH Aachen University, Germany
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
database inconsistency
database repairs
document spanners
finite-state transducers
information extraction
prioritized repairs
regular expressions
Qualifiers
- tutorial
Conference

Acceptance Rates
PODS '14 Paper Acceptance Rate22of67submissions,33%Overall Acceptance Rate642of2,707submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 461
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Database principles in information extraction

PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cleaning inconsistencies in information extraction via prioritized repairs

Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity

A Relational Framework for Information Extraction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Database principles in information extraction

PODS '14: Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Cleaning inconsistencies in information extraction via prioritized repairs

Document Spanners for Extracting Incomplete Information: Expressiveness and Complexity

A Relational Framework for Information Extraction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media