research-article

ONDUX: on-demand unsupervised learning for information extraction

Authors:
Eli Cortez

Universidade Federal do Amazonas, Manaus, Brazil

Universidade Federal do Amazonas, Manaus, Brazil
View Profile

,
Altigran S. da Silva

Universidade Federal do Amazonas, Manaus, Brazil

Universidade Federal do Amazonas, Manaus, Brazil
View Profile

,
Marcos André Gonçalves

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil

Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
View Profile

,
Edleno S. de Moura

Universidade Federal do Amazonas, Manaus, Brazil

Universidade Federal do Amazonas, Manaus, Brazil
View Profile

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of dataJune 2010Pages 807–818https://doi.org/10.1145/1807167.1807254

Published:06 June 2010Publication History

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 807–818

ABSTRACT

Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed in the recent literature. In this paper we introduce ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS. As other unsupervised IETS approaches, ONDUX relies on information available on pre-existing data to associate segments in the input string with attributes of a given domain. Unlike other approaches, we rely on very effective matching strategies instead of explicit learning strategies. The effectiveness of this matching strategy is also exploited to disambiguate the extraction of certain attributes through a reinforcement step that explores sequencing and positioning of attribute values directly learned on-demand from test data, with no previous human-driven training, a feature unique to ONDUX. This assigns to ONDUX a high degree of flexibility and results in superior effectiveness, as demonstrated by the experimental evaluation we report with textual sources from different domains, in which ONDUX is compared with a state-of-art IETS approach.

References

E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 20--29, Seattle, Washington,USA, 2004. Google ScholarDigital Library
S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated ranking of database query results. Proc. of CIDR 2003, Biennial Conference on Innovative Data Systems Research, 2003.Google Scholar
T. Anderson and J. Finn. The New Statistical Analysis of Data. Springer, 1996.Google ScholarCross Ref
V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. Proc. of the ACM SIGMOD International Conference on Management of Data, pages 175--186, 2001. Google ScholarDigital Library
S. Chuang, K. Chang, and C. Zhai. Context-aware wrapping: synchronized data extraction. Proc. of the 33rd Intl. Conf. on Very Large Databases, pages 699--710, Viena, Austria, 2007. Google ScholarDigital Library
E. Cortez, A. da Silva, M. Gonçalves, F. Mesquita, and E. de Moura. FLUX-CIM: flexible unsupervised extraction of citation metadata. Proc. of the 2007 conference on Digital libraries, pages 215--224, 2007. Google ScholarDigital Library
E. Cortez, A. da Silva, M. Gonçalves, F. Mesquita, and E. de Moura. A flexible approach for extracting metadata from bibliographic citations. Journal of the American Society for Information Science and Technology, Online version, 2009. Google ScholarDigital Library
D. Freitag and A. McCallum. Information extraction with hmm structures learned by stochastic optimization. In Proc. of the 17th National Conf. on Artificial Intelligence and 12th Conf. on Innovative Applications of Artificial Intelligence, pages 584--589, Austin, Texas, USA, 2000. Google ScholarDigital Library
T. Joachims. Transductive inference for text classification using support vector machines. In Proc. of the International Conference on Machine Learning, pages 200--209, Bled, Slovenia, 1999. Google ScholarDigital Library
L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. J. Artif. Intell. Res. (JAIR), 4:237--285, 1996. Google ScholarDigital Library
J. Laerty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. of the Eighteenth International Conference on Machine Learning, pages 282--289, 2001. Google ScholarDigital Library
I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In Proc. of the International Conference on Data Engineering, page 29. IEEE Computer Society, 2006. Google ScholarDigital Library
A. McCallum. Cora Information Extraction Collection.Google Scholar
F. Mesquita, A. da Silva, E. de Moura, P. Calado, and A. Laender. LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces. Information Processing and Management, 43(4):983--1004, 2007. Google ScholarDigital Library
I. Muslea. Rise - A Repository of Online Information Sources used in Information Extraction Tasks.Google Scholar
U. Nambiar and S. Kambhampati. Answering imprecise queries over autonomous web databases. In Proc. of the International Conference on Data Engineering, page 45, Washington, DC, USA, 2006. Google ScholarDigital Library
J. Pearl and G. Shafer. Probabilistic reasoning in intelligent systems: networks of plausible inference Morgan Kaufmann San Mateo, CA, 1988. Google ScholarDigital Library
F. Peng and A. McCallum. Information extraction from research papers using conditional random fields. Information Processing Management, 42(4):963--979, 2006. Google ScholarDigital Library
S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008. Google ScholarDigital Library
C. Zhao, J. Mahmud, and I. V. Ramakrishnan. Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In Proc. of the SIAM International Conference on Data Mining, pages 420--431, Atlanta, Georgia, USA, 2008.Google ScholarCross Ref

Index Terms

ONDUX: on-demand unsupervised learning for information extraction
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Data management systems

Recommendations

Unsupervised strategies for information extraction by text segmentation
IDAR '10: Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database Research

Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an ...
Read More
Joint unsupervised structure discovery and information extraction
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

In this paper we present JUDIE (Joint Unsupervised Structure Discovery and Information Extraction), a new method for automatically extracting semi-structured data records in the form of continuous text (e.g., bibliographic citations, postal addresses, ...
Read More
Mining reference tables for automatic text segmentation
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

Automatically segmenting unstructured text strings into structured records is necessary for importing the information contained in legacy sources and text collections into a data warehouse for subsequent querying, analysis, mining and integration. In ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
June 2010
1286 pages
ISBN:9781450300322
DOI:10.1145/1807167
General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data management
information extraction
text segmentation
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 951
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

ONDUX: on-demand unsupervised learning for information extraction

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Unsupervised strategies for information extraction by text segmentation

Joint unsupervised structure discovery and information extraction

Mining reference tables for automatic text segmentation