skip to main content
10.1145/1807167.1807254acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

ONDUX: on-demand unsupervised learning for information extraction

Published:06 June 2010Publication History

ABSTRACT

Information extraction by text segmentation (IETS) applies to cases in which data values of interest are organized in implicit semi-structured records available in textual sources (e.g. postal addresses, bibliographic information, ads). It is an important practical problem that has been frequently addressed in the recent literature. In this paper we introduce ONDUX (On Demand Unsupervised Information Extraction), a new unsupervised probabilistic approach for IETS. As other unsupervised IETS approaches, ONDUX relies on information available on pre-existing data to associate segments in the input string with attributes of a given domain. Unlike other approaches, we rely on very effective matching strategies instead of explicit learning strategies. The effectiveness of this matching strategy is also exploited to disambiguate the extraction of certain attributes through a reinforcement step that explores sequencing and positioning of attribute values directly learned on-demand from test data, with no previous human-driven training, a feature unique to ONDUX. This assigns to ONDUX a high degree of flexibility and results in superior effectiveness, as demonstrated by the experimental evaluation we report with textual sources from different domains, in which ONDUX is compared with a state-of-art IETS approach.

References

  1. E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. Proc. of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 20--29, Seattle, Washington,USA, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Agrawal, S. Chaudhuri, G. Das, and A. Gionis. Automated ranking of database query results. Proc. of CIDR 2003, Biennial Conference on Innovative Data Systems Research, 2003.Google ScholarGoogle Scholar
  3. T. Anderson and J. Finn. The New Statistical Analysis of Data. Springer, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  4. V. R. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. Proc. of the ACM SIGMOD International Conference on Management of Data, pages 175--186, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Chuang, K. Chang, and C. Zhai. Context-aware wrapping: synchronized data extraction. Proc. of the 33rd Intl. Conf. on Very Large Databases, pages 699--710, Viena, Austria, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Cortez, A. da Silva, M. Gonçalves, F. Mesquita, and E. de Moura. FLUX-CIM: flexible unsupervised extraction of citation metadata. Proc. of the 2007 conference on Digital libraries, pages 215--224, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. E. Cortez, A. da Silva, M. Gonçalves, F. Mesquita, and E. de Moura. A flexible approach for extracting metadata from bibliographic citations. Journal of the American Society for Information Science and Technology, Online version, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Freitag and A. McCallum. Information extraction with hmm structures learned by stochastic optimization. In Proc. of the 17th National Conf. on Artificial Intelligence and 12th Conf. on Innovative Applications of Artificial Intelligence, pages 584--589, Austin, Texas, USA, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Joachims. Transductive inference for text classification using support vector machines. In Proc. of the International Conference on Machine Learning, pages 200--209, Bled, Slovenia, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. P. Kaelbling, M. L. Littman, and A. P. Moore. Reinforcement learning: A survey. J. Artif. Intell. Res. (JAIR), 4:237--285, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Laerty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. of the Eighteenth International Conference on Machine Learning, pages 282--289, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In Proc. of the International Conference on Data Engineering, page 29. IEEE Computer Society, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A. McCallum. Cora Information Extraction Collection.Google ScholarGoogle Scholar
  14. F. Mesquita, A. da Silva, E. de Moura, P. Calado, and A. Laender. LABRADOR: Efficiently publishing relational databases on the web by using keyword-based query interfaces. Information Processing and Management, 43(4):983--1004, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Muslea. Rise - A Repository of Online Information Sources used in Information Extraction Tasks.Google ScholarGoogle Scholar
  16. U. Nambiar and S. Kambhampati. Answering imprecise queries over autonomous web databases. In Proc. of the International Conference on Data Engineering, page 45, Washington, DC, USA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Pearl and G. Shafer. Probabilistic reasoning in intelligent systems: networks of plausible inference Morgan Kaufmann San Mateo, CA, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Peng and A. McCallum. Information extraction from research papers using conditional random fields. Information Processing Management, 42(4):963--979, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Sarawagi. Information extraction. Foundations and Trends in Databases, 1(3):261--377, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. Zhao, J. Mahmud, and I. V. Ramakrishnan. Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In Proc. of the SIAM International Conference on Data Mining, pages 420--431, Atlanta, Georgia, USA, 2008.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. ONDUX: on-demand unsupervised learning for information extraction

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
        June 2010
        1286 pages
        ISBN:9781450300322
        DOI:10.1145/1807167

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 6 June 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader