ABSTRACT
Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed in the recent literature. In this paper we introduce ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS. As other unsupervised IETS approaches, ONDUX relies on information available on pre-existing data to associate segments in the input string with attributes of a given domain. Unlike other approaches, we rely on very effective matching strategies instead of explicit learning strategies. The effectiveness of this matching strategy is also exploited to disambiguate the extraction of certain attributes through a reinforcement step that explores sequencing and positioning of attribute values directly learned on-demand from test data, with no previous human-driven training, a feature unique to ONDUX. This assigns to ONDUX a high degree of flexibility and results in superior effectiveness, as demonstrated by the experimental evaluation we report with textual sources from different domains, in which ONDUX is compared with a state-of-art IETS approach.
- E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 20--29, Seattle, Washington,USA, 2004. Google ScholarDigital Library
- S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated ranking of database query results. Proc. of CIDR 2003, Biennial Conference on Innovative Data Systems Research, 2003.Google Scholar
- T. Anderson and J. Finn. The New Statistical Analysis of Data. Springer, 1996.Google ScholarCross Ref
- V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. Proc. of the ACM SIGMOD International Conference on Management of Data, pages 175--186, 2001. Google ScholarDigital Library
- S. Chuang, K. Chang, and C. Zhai. Context-aware wrapping: synchronized data extraction. Proc. of the 33rd Intl. Conf. on Very Large Databases, pages 699--710, Viena, Austria, 2007. Google ScholarDigital Library
- E. Cortez, A. da Silva, M. Gonçalves, F. Mesquita, and E. de Moura. FLUX-CIM: flexible unsupervised extraction of citation metadata. Proc. of the 2007 conference on Digital libraries, pages 215--224, 2007. Google ScholarDigital Library
- E. Cortez, A. da Silva, M. Gonçalves, F. Mesquita, and E. de Moura. A flexible approach for extracting metadata from bibliographic citations. Journal of the American Society for Information Science and Technology, Online version, 2009. Google ScholarDigital Library
- D. Freitag and A. McCallum. Information extraction with hmm structures learned by stochastic optimization. In Proc. of the 17th National Conf. on Artificial Intelligence and 12th Conf. on Innovative Applications of Artificial Intelligence, pages 584--589, Austin, Texas, USA, 2000. Google ScholarDigital Library
- T. Joachims. Transductive inference for text classification using support vector machines. In Proc. of the International Conference on Machine Learning, pages 200--209, Bled, Slovenia, 1999. Google ScholarDigital Library
- L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. J. Artif. Intell. Res. (JAIR), 4:237--285, 1996. Google ScholarDigital Library
- J. Laerty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. of the Eighteenth International Conference on Machine Learning, pages 282--289, 2001. Google ScholarDigital Library
- I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In Proc. of the International Conference on Data Engineering, page 29. IEEE Computer Society, 2006. Google ScholarDigital Library
- A. McCallum. Cora Information Extraction Collection.Google Scholar
- F. Mesquita, A. da Silva, E. de Moura, P. Calado, and A. Laender. LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces. Information Processing and Management, 43(4):983--1004, 2007. Google ScholarDigital Library
- I. Muslea. Rise - A Repository of Online Information Sources used in Information Extraction Tasks.Google Scholar
- U. Nambiar and S. Kambhampati. Answering imprecise queries over autonomous web databases. In Proc. of the International Conference on Data Engineering, page 45, Washington, DC, USA, 2006. Google ScholarDigital Library
- J. Pearl and G. Shafer. Probabilistic reasoning in intelligent systems: networks of plausible inference Morgan Kaufmann San Mateo, CA, 1988. Google ScholarDigital Library
- F. Peng and A. McCallum. Information extraction from research papers using conditional random fields. Information Processing Management, 42(4):963--979, 2006. Google ScholarDigital Library
- S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008. Google ScholarDigital Library
- C. Zhao, J. Mahmud, and I. V. Ramakrishnan. Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In Proc. of the SIAM International Conference on Data Mining, pages 420--431, Atlanta, Georgia, USA, 2008.Google ScholarCross Ref
Index Terms
- ONDUX: on-demand unsupervised learning for information extraction
Recommendations
Unsupervised strategies for information extraction by text segmentation
IDAR '10: Proceedings of the Fourth SIGMOD PhD Workshop on Innovative Database ResearchInformation extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an ...
Joint unsupervised structure discovery and information extraction
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataIn this paper we present JUDIE (Joint Unsupervised Structure Discovery and Information Extraction), a new method for automatically extracting semi-structured data records in the form of continuous text (e.g., bibliographic citations, postal addresses, ...
Mining reference tables for automatic text segmentation
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data miningAutomatically segmenting unstructured text strings into structured records is necessary for importing the information contained in legacy sources and text collections into a data warehouse for subsequent querying, analysis, mining and integration. In ...
Comments