skip to main content
10.1145/2063576.2063713acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Semi-supervised multi-task learning of structured prediction models for web information extraction

Published: 24 October 2011 Publication History

Abstract

Extracting information from web pages is an important problem; it has several applications such as providing improved search results and construction of databases to serve user queries. In this paper we propose a novel structured prediction method to address two important aspects of the extraction problem: (1) labeled data is available only for a small number of sites and (2) a machine learned global model does not generalize adequately well across many websites. For this purpose, we propose a weight space based graph regularization method. This method has several advantages. First, it can use unlabeled data to address the limited labeled data problem and falls in the class of graph regularization based semi-supervised learning approaches. Second, to address the generalization inadequacy of a global model, this method builds a local model for each website. Viewing the problem of building a local model for each website as a task, we learn the models for a collection of sites jointly; thus our method can also be seen as a graph regularization based multi-task learning approach. Learning the models jointly with the proposed method is very useful in two ways: (1) learning a local model for a website can be effectively influenced by labeled and unlabeled data from other websites; and (2) even for a website with only unlabeled examples it is possible to learn a decent local model. We demonstrate the efficacy of our method on several real-life data; experimental results show that significant performance improvement can be obtained by combining semi-supervised and multi-task learning in a single framework.

References

[1]
E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In ACM SIGKDD, 2004.
[2]
Y. Altun, D. McAllester, and M. Belkin. Maximum margin semi-supervised learning for structured variables. In NIPS, 2005.
[3]
M. Alvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda. Using clustering and edit distance techniques for automatic web data extraction. In WISE, 2007.
[4]
R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. In JMLR, volume 6, pages 1817--1853, 2005.
[5]
V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In ACM SIGMOD, 2001.
[6]
R. Caruana. Multi-task learning. In Machine Learning, volume 28, pages 41--75, 1997.
[7]
C.-H. Chang, M. Kayed, M. R. Girgis, and K. Shaalan. A survey of web information extraction systems. IEEE transactions on knowledge and data engineering, 18:1411--1428, 2006.
[8]
B. Chen, W. Lam, I. Tsang, and T.-L. Wong. Extracting discriminative concepts for domain adaptation in text mining. In KDD, 2009.
[9]
S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: synchronized data extraction. In Proceedings of the 33rd international conference on Very large data bases, VLDB '07, pages 699--710. VLDB Endowment, 2007.
[10]
E. Cortez, A. S. da Silva, M. A. Gonçalves, and E. S. de Moura. Ondux: on-demand unsupervised learning for information extraction. In SIGMOD Conference, pages 807--818, 2010.
[11]
V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001.
[12]
L. Duan, I. W. Tsang, D. Xu, and T.-S. Chua. Domain adaptation from multiple sources via auxiliary classifiers. In ICML, 2009.
[13]
T. Evgeniou, C. A. Michelli, and M. Pontil. Learning multiple tasks with kernel methods. In JMLR, volume 6, pages 615--637, 2005.
[14]
T. Evgeniou and M. Pontil. Regularized multi-task learning. In KDD, 2004.
[15]
R. Gupta and S. Sarawagi. Answering table augmentation queries from unstrcutured lists on the web. In VLDB, 2009.
[16]
J. Honorio and D. Samaras. Multi-task learning of Gaussian graphical models. In ICML, 2010.
[17]
D. Jensen, J. Neville, and B. Gallagher. Why collective inference improves relational classification. In ACM SIGKDD, pages 593--598, 2004.
[18]
F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved segmentation and labeling. In ACL, 2006.
[19]
N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.
[20]
J. Lafferty, Y. Liu, and X. Zhu. Kernel conditional random fields. In ICML, 2004.
[21]
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
[22]
Q. Liu, X. Liao, and L. Carin. Semi-supervised multitask learning. In NIPS, 2007.
[23]
Q. Lu and L. Getoor. Link based classification. In ICML, pages 496--503, 2003.
[24]
I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In ICDE, page 29, 2006.
[25]
G. Miao, J. Tatemura, W. Hsiung, A. Sawires, and L. Moser. Extracting data records from the web using tag path clustering. In WWW, 2009.
[26]
M. Michelson and C. A. Knoblock. Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web. IJDAR, 10(3--4):211--226, 2007.
[27]
I. Muslea, S. Minton, and C. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 1(2), 2001.
[28]
P. Papotti, V. Crescenzi, P. Merialdo, M. Bronzi, and L. Blanco. Redundancy-driven web data extraction and integration. In WebDB, 2010.
[29]
F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL04, 2004.
[30]
H. Poon and P. Domingos. Joint inference in information extraction. In 22nd AAAI, 2007.
[31]
S. Sarawagi. Information extraction. Foundations and trends in databases, 1(3):261--377, 2008.
[32]
S. Satpal and S. Sarawagi. Domain adaptation of conditional probability models via feature subsetting. In ECML-PKDD, 2007.
[33]
P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad. Collective classification in network data. Technical Report CS-TR-4905, University of Maryland, 2008.
[34]
P. Senellart, A. Mittal, D. Muschick, R. Gilleron, and M. Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge. In WIDM, 2008.
[35]
A. Subramanya, S. Petrov, and F. Pereira. Efficient graph-based semi-supervised learning of structured tagging models. In EMNLP, pages 167--176, 2010.
[36]
J. Suzuki and H. Isozaki. Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In ACL, 2008.
[37]
Y. Wang, G. Haffari, S. Wang, and G. Mori. A rate distortion approach for semi-supervised conditional random fields. In NIPS, 2009.
[38]
T. Weninger, W. H. Hsu, and J. Han. CETR - content extraction via tag ratios. In WWW, 2010.
[39]
Y. Zhai and B. Liu. Web data extraction based on partial tree assignment. In WWW, 2005.
[40]
C. Zhao, J. Mahmud, and I. V. Ramakrishnan. Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In SDM, pages 420--431, 2008.
[41]
J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In ACM SIGKDD, 2006.

Cited By

View all
  • (2020)Robust P2P Personalized Learning2020 International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS51746.2020.00037(299-308)Online publication date: Sep-2020
  • (2019)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-0Online publication date: 22-Jul-2019
  • (2018)A novel alignment algorithm for effective web data extraction from singleton-item pagesApplied Intelligence10.1007/s10489-018-1208-048:11(4355-4370)Online publication date: 1-Nov-2018
  • Show More Cited By

Index Terms

  1. Semi-supervised multi-task learning of structured prediction models for web information extraction

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management
      October 2011
      2712 pages
      ISBN:9781450307178
      DOI:10.1145/2063576
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 24 October 2011

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. information extraction
      2. multitask learning
      3. semi-supervised learning
      4. structured predictions

      Qualifiers

      • Research-article

      Conference

      CIKM '11
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)7
      • Downloads (Last 6 weeks)1
      Reflects downloads up to 03 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2020)Robust P2P Personalized Learning2020 International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS51746.2020.00037(299-308)Online publication date: Sep-2020
      • (2019)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-0Online publication date: 22-Jul-2019
      • (2018)A novel alignment algorithm for effective web data extraction from singleton-item pagesApplied Intelligence10.1007/s10489-018-1208-048:11(4355-4370)Online publication date: 1-Nov-2018
      • (2015)Convolved Multi-output Gaussian Processes for Semi-Supervised LearningImage Analysis and Processing — ICIAP 201510.1007/978-3-319-23231-7_10(109-118)Online publication date: 21-Aug-2015
      • (2014)Multi-task least-squares support vector machinesMultimedia Tools and Applications10.1007/s11042-013-1526-571:2(699-715)Online publication date: 1-Jul-2014

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media