research-article

Semi-supervised multi-task learning of structured prediction models for web information extraction

Authors:

Paramveer S. Dhillon,

Sundararajan Sellamanickam,

Sathiya Keerthi SelvarajAuthors Info & Claims

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Pages 957 - 966

https://doi.org/10.1145/2063576.2063713

Published: 24 October 2011 Publication History

Abstract

Extracting information from web pages is an important problem; it has several applications such as providing improved search results and construction of databases to serve user queries. In this paper we propose a novel structured prediction method to address two important aspects of the extraction problem: (1) labeled data is available only for a small number of sites and (2) a machine learned global model does not generalize adequately well across many websites. For this purpose, we propose a weight space based graph regularization method. This method has several advantages. First, it can use unlabeled data to address the limited labeled data problem and falls in the class of graph regularization based semi-supervised learning approaches. Second, to address the generalization inadequacy of a global model, this method builds a local model for each website. Viewing the problem of building a local model for each website as a task, we learn the models for a collection of sites jointly; thus our method can also be seen as a graph regularization based multi-task learning approach. Learning the models jointly with the proposed method is very useful in two ways: (1) learning a local model for a website can be effectively influenced by labeled and unlabeled data from other websites; and (2) even for a website with only unlabeled examples it is possible to learn a decent local model. We demonstrate the efficacy of our method on several real-life data; experimental results show that significant performance improvement can be obtained by combining semi-supervised and multi-task learning in a single framework.

References

[1]

E. Agichtein and V. Ganti. Mining reference tables for automatic text segmentation. In ACM SIGKDD, 2004.

Digital Library

[2]

Y. Altun, D. McAllester, and M. Belkin. Maximum margin semi-supervised learning for structured variables. In NIPS, 2005.

[3]

M. Alvarez, A. Pan, J. Raposo, F. Bellas, and F. Cacheda. Using clustering and edit distance techniques for automatic web data extraction. In WISE, 2007.

Digital Library

[4]

R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unlabeled data. In JMLR, volume 6, pages 1817--1853, 2005.

Digital Library

[5]

V. Borkar, K. Deshmukh, and S. Sarawagi. Automatic segmentation of text into structured records. In ACM SIGMOD, 2001.

Digital Library

[6]

R. Caruana. Multi-task learning. In Machine Learning, volume 28, pages 41--75, 1997.

Digital Library

[7]

C.-H. Chang, M. Kayed, M. R. Girgis, and K. Shaalan. A survey of web information extraction systems. IEEE transactions on knowledge and data engineering, 18:1411--1428, 2006.

Digital Library

[8]

B. Chen, W. Lam, I. Tsang, and T.-L. Wong. Extracting discriminative concepts for domain adaptation in text mining. In KDD, 2009.

Digital Library

[9]

S.-L. Chuang, K. C.-C. Chang, and C. Zhai. Context-aware wrapping: synchronized data extraction. In Proceedings of the 33rd international conference on Very large data bases, VLDB '07, pages 699--710. VLDB Endowment, 2007.

Digital Library

[10]

E. Cortez, A. S. da Silva, M. A. Gonçalves, and E. S. de Moura. Ondux: on-demand unsupervised learning for information extraction. In SIGMOD Conference, pages 807--818, 2010.

Digital Library

[11]

V. Crescenzi, G. Mecca, and P. Merialdo. Roadrunner: Towards automatic data extraction from large web sites. In VLDB, 2001.

Digital Library

[12]

L. Duan, I. W. Tsang, D. Xu, and T.-S. Chua. Domain adaptation from multiple sources via auxiliary classifiers. In ICML, 2009.

Digital Library

[13]

T. Evgeniou, C. A. Michelli, and M. Pontil. Learning multiple tasks with kernel methods. In JMLR, volume 6, pages 615--637, 2005.

Digital Library

[14]

T. Evgeniou and M. Pontil. Regularized multi-task learning. In KDD, 2004.

Digital Library

[15]

R. Gupta and S. Sarawagi. Answering table augmentation queries from unstrcutured lists on the web. In VLDB, 2009.

Digital Library

[16]

J. Honorio and D. Samaras. Multi-task learning of Gaussian graphical models. In ICML, 2010.

Digital Library

[17]

D. Jensen, J. Neville, and B. Gallagher. Why collective inference improves relational classification. In ACM SIGKDD, pages 593--598, 2004.

Digital Library

[18]

F. Jiao, S. Wang, C.-H. Lee, R. Greiner, and D. Schuurmans. Semi-supervised conditional random fields for improved segmentation and labeling. In ACL, 2006.

Digital Library

[19]

N. Kushmerick, D. S. Weld, and R. Doorenbos. Wrapper induction for information extraction. In IJCAI, 1997.

Digital Library

[20]

J. Lafferty, Y. Liu, and X. Zhu. Kernel conditional random fields. In ICML, 2004.

Digital Library

[21]

J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.

Digital Library

[22]

Q. Liu, X. Liao, and L. Carin. Semi-supervised multitask learning. In NIPS, 2007.

[23]

Q. Lu and L. Getoor. Link based classification. In ICML, pages 496--503, 2003.

[24]

I. R. Mansuri and S. Sarawagi. Integrating unstructured data into relational databases. In ICDE, page 29, 2006.

Digital Library

[25]

G. Miao, J. Tatemura, W. Hsiung, A. Sawires, and L. Moser. Extracting data records from the web using tag path clustering. In WWW, 2009.

Digital Library

[26]

M. Michelson and C. A. Knoblock. Unsupervised information extraction from unstructured, ungrammatical data sources on the world wide web. IJDAR, 10(3--4):211--226, 2007.

Digital Library

[27]

I. Muslea, S. Minton, and C. Knoblock. Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems, 1(2), 2001.

Digital Library

[28]

P. Papotti, V. Crescenzi, P. Merialdo, M. Bronzi, and L. Blanco. Redundancy-driven web data extraction and integration. In WebDB, 2010.

[29]

F. Peng and A. McCallum. Accurate information extraction from research papers using conditional random fields. In HLT-NAACL04, 2004.

[30]

H. Poon and P. Domingos. Joint inference in information extraction. In 22nd AAAI, 2007.

Digital Library

[31]

S. Sarawagi. Information extraction. Foundations and trends in databases, 1(3):261--377, 2008.

Digital Library

[32]

S. Satpal and S. Sarawagi. Domain adaptation of conditional probability models via feature subsetting. In ECML-PKDD, 2007.

Digital Library

[33]

P. Sen, G. M. Namata, M. Bilgic, L. Getoor, B. Gallagher, and T. Eliassi-Rad. Collective classification in network data. Technical Report CS-TR-4905, University of Maryland, 2008.

[34]

P. Senellart, A. Mittal, D. Muschick, R. Gilleron, and M. Tommasi. Automatic wrapper induction from hidden-web sources with domain knowledge. In WIDM, 2008.

Digital Library

[35]

A. Subramanya, S. Petrov, and F. Pereira. Efficient graph-based semi-supervised learning of structured tagging models. In EMNLP, pages 167--176, 2010.

Digital Library

[36]

J. Suzuki and H. Isozaki. Semi-supervised sequential labeling and segmentation using giga-word scale unlabeled data. In ACL, 2008.

[37]

Y. Wang, G. Haffari, S. Wang, and G. Mori. A rate distortion approach for semi-supervised conditional random fields. In NIPS, 2009.

[38]

T. Weninger, W. H. Hsu, and J. Han. CETR - content extraction via tag ratios. In WWW, 2010.

Digital Library

[39]

Y. Zhai and B. Liu. Web data extraction based on partial tree assignment. In WWW, 2005.

Digital Library

[40]

C. Zhao, J. Mahmud, and I. V. Ramakrishnan. Exploiting structured reference data for unsupervised text segmentation with conditional random fields. In SDM, pages 420--431, 2008.

[41]

J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. Simultaneous record detection and attribute labeling in web data extraction. In ACM SIGKDD, 2006.

Digital Library

Cited By

Boubouh KBoussetta ABenkaouz YGuerraoui R(2020)Robust P2P Personalized Learning2020 International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS51746.2020.00037(299-308)Online publication date: Sep-2020
https://doi.org/10.1109/SRDS51746.2020.00037
Yuliana OChang C(2019)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-0Online publication date: 22-Jul-2019
https://doi.org/10.1007/s10489-019-01499-0
Yuliana OChang C(2018)A novel alignment algorithm for effective web data extraction from singleton-item pagesApplied Intelligence10.1007/s10489-018-1208-048:11(4355-4370)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1007/s10489-018-1208-0
Show More Cited By

Index Terms

Semi-supervised multi-task learning of structured prediction models for web information extraction
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

Inductive Semi-supervised Multi-Label Learning with Co-Training
KDD '17: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

In multi-label learning, each training example is associated with multiple class labels and the task is to learn a mapping from the feature space to the power set of label space. It is generally demanding and time-consuming to obtain labels for training ...
Coupled semi-supervised learning for information extraction
WSDM '10: Proceedings of the third ACM international conference on Web search and data mining

We consider the problem of semi-supervised learning to extract categories (e.g., academic fields, athletes) and relations (e.g., PlaysSport(athlete, sport)) from web pages, starting with a handful of labeled training examples of each category or ...
Semi-supervised multi-label classification using incomplete label information
Highlights
- An inductive semi-supervised method called Smile is proposed for multi-label classification using incomplete label information.
Abstract
Classifying multi-label instances using incompletely labeled instances is one of the fundamental tasks in multi-label learning. Most existing methods regard this task as supervised weak-label learning problem and assume sufficient ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

October 2011

2712 pages

ISBN:9781450307178

DOI:10.1145/2063576

Editors:
Bettina Berendt,
Arjen de Vries,
Wenfei Fan,
Craig Macdonald
University of Glasgow, UK
,
Iadh Ounis
University of Glasgow, UK
,
Ian Ruthven
University of Strathclyde, UK

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

CIKM '11

Sponsor:

CIKM '11: International Conference on Information and Knowledge Management

October 24 - 28, 2011

Glasgow, Scotland, UK

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
304
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Boubouh KBoussetta ABenkaouz YGuerraoui R(2020)Robust P2P Personalized Learning2020 International Symposium on Reliable Distributed Systems (SRDS)10.1109/SRDS51746.2020.00037(299-308)Online publication date: Sep-2020
https://doi.org/10.1109/SRDS51746.2020.00037
Yuliana OChang C(2019)DCADE: divide and conquer alignment with dynamic encoding for full page data extractionApplied Intelligence10.1007/s10489-019-01499-0Online publication date: 22-Jul-2019
https://doi.org/10.1007/s10489-019-01499-0
Yuliana OChang C(2018)A novel alignment algorithm for effective web data extraction from singleton-item pagesApplied Intelligence10.1007/s10489-018-1208-048:11(4355-4370)Online publication date: 1-Nov-2018
https://dl.acm.org/doi/10.1007/s10489-018-1208-0
Cardona HÁlvarez MOrozco Á(2015)Convolved Multi-output Gaussian Processes for Semi-Supervised LearningImage Analysis and Processing — ICIAP 201510.1007/978-3-319-23231-7_10(109-118)Online publication date: 21-Aug-2015
https://doi.org/10.1007/978-3-319-23231-7_10
Xu SAn XQiao XZhu L(2014)Multi-task least-squares support vector machinesMultimedia Tools and Applications10.1007/s11042-013-1526-571:2(699-715)Online publication date: 1-Jul-2014
https://dl.acm.org/doi/10.1007/s11042-013-1526-5

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents