research-article

A framework for learning web wrappers from the crowd

Authors:
Valter Crescenzi

Università Roma Tre, Rome, Italy

Università Roma Tre, Rome, Italy
View Profile

,
Paolo Merialdo

Università Roma Tre, Rome, Italy

Università Roma Tre, Rome, Italy
View Profile

,
Disheng Qiu

Università Roma Tre, Rome, Italy

Università Roma Tre, Rome, Italy
View Profile

WWW '13: Proceedings of the 22nd international conference on World Wide WebMay 2013Pages 261–272https://doi.org/10.1145/2488388.2488412

Published:13 May 2013Publication History

WWW '13: Proceedings of the 22nd international conference on World Wide Web

Pages 261–272

ABSTRACT

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches but the costs of training data, i.e., annotations over a set of sample pages, limit their scalability. Crowd sourcing platforms are making the manual annotation process more affordable. However, the tasks demanded to these platforms should be extremely simple, to be performed by non-expert people, and their number should be minimized, to contain the costs. We introduce a framework to support a supervised wrapper inference system with training data generated by the crowd. Training data are labeled values generated by means of membership queries, the simplest form of queries, posed to the crowd. We show that the costs of producing the training data are strongly affected by the expressiveness of the wrapper formalism and by the choice of the training set. Traditional supervised wrapper inference approaches use a statically defined formalism, assuming it is able to express the wrapper. Conversely, we present an inference algorithm that dynamically chooses the expressiveness of the wrapper formalism and actively selects the training set, while minimizing the number of membership queries to the crowd. We report the results of experiments on real web sources to confirm the effectiveness and the feasibility of the approach.

References

D. Angluin. Queries revisited. Theor. Comput. Sci., 313(2):175--194, 2004. Google ScholarDigital Library
A. Arasu and H. Garcia-Molina. Extracting structured data from web pages. In SIGMOD Conference, pages 337--348. ACM, 2003. Google ScholarDigital Library
M.-F. Balcan, S. Hanneke, and J. W. Vaughan. The true sample complexity of active learning. Machine Learning, 80(2-3):111--139, 2010.Google ScholarDigital Library
C.-H. Chang and S.-C. Lui. IEPAD: information extraction based on pattern discovery. In WWW, pages 681--688, 2001. Google ScholarDigital Library
R. Creo, V. Crescenzi, D. Qiu, and P. Merialdo. Minimizing the costs of the training data for learning web wrappers. In VLDS, pages 35--40, 2012.Google Scholar
V. Crescenzi and G. Mecca. Automatic information extraction from large websites. J. ACM, 51(5):731--779, 2004. Google ScholarDigital Library
V. Crescenzi and P. Merialdo. Wrapper inference for ambiguous web pages. Applied Artificial Intelligence, 22(1&2):21--52, 2008. Google ScholarDigital Library
N. N. Dalvi, R. Kumar, and M. A. Soliman. Automatic wrappers for large scale web extraction. PVLDB, 4(4):219--230, 2011. Google ScholarDigital Library
U. Irmak and T. Suel. Interactive wrapper generation with minimal user effort. In WWW, pages 553--563. ACM, 2006. Google ScholarDigital Library
T. Furche, G. Gottlob, G. Grasso, O. Gunes, X. Guo, A. Kravchenko, G. Orsi, C. Schallhart, A. J. Sellers, and C. Wang. DIADEM: domain-centric, intelligent, automated data extraction methodology. In WWW (Companion Volume), pages 267--270. ACM, 2012. Google ScholarDigital Library
G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, and S. Flesca. The lixto data extraction project - back and forth between theory and practice. In PODS, pages 1--12. ACM, 2004. Google ScholarDigital Library
I. Muslea, S. Minton, and C. A. Knoblock. Active learning with multiple views. J. Artif. Intell. Res. (JAIR), 27:203--233, 2006. Google ScholarDigital Library
B. Settles. Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison, 2009.Google Scholar
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926--1940, 1998. Google ScholarDigital Library
V. Vapnik. An overview of statistical learning theory. IEEE Transactions on Neural Networks, 10(5):988--999, 1999. Google ScholarDigital Library
Y. Zhai and B. Liu. Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng., 18(12):1614--1628, 2006. Google ScholarDigital Library

Index Terms

A framework for learning web wrappers from the crowd
1. Information systems
  1. World Wide Web
    1. Web applications
    2. Web services

Recommendations

ALFRED: crowd assisted data extraction
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

The development of solutions to scale the extraction of data from Web sources is still a challenging issue. High accuracy can be achieved by supervised approaches, but the costs of training data, i.e., annotations over a set of sample pages, limit their ...
Read More
A formalized framework for incorporating expert labels in crowdsourcing environment

Crowdsourcing services have been proven efficient in collecting large amount of labeled data for supervised learning tasks. However, the low cost of crowd workers leads to unreliable labels, a new problem for learning a reliable classifier. Various ...
Read More
Cost‐effective multi‐instance multilabel active learning
Abstract
Multi‐instance multi‐label (MIML) Active Learning (M2AL) aims to improve the learner while reducing the cost as much as possible by querying informative labels of complex bags composed of diverse instances. Existing M2AL solutions suffer high ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '13: Proceedings of the 22nd international conference on World Wide Web
May 2013
1628 pages
ISBN:9781450320351
DOI:10.1145/2488388
General Chairs:
Daniel Schwabe
PUC-Rio - Brazil
,
Virgílio Almeida
UFMG - Brazil
,
Hartmut Glaser
CGI.br - Brazil
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Labs - Spain & Chile
,
Sue Moon
KAIST - South Korea
Copyright © 2013 Copyright is held by the International World Wide Web Conference Committee (IW3C2).
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 May 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
active learning
crowdsourcing
wrapper generation
Qualifiers
- research-article
Conference

Acceptance Rates
WWW '13 Paper Acceptance Rate125of831submissions,15%Overall Acceptance Rate1,899of8,196submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 19
  Total Citations
  View Citations
- 331
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A framework for learning web wrappers from the crowd

WWW '13: Proceedings of the 22nd international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

ALFRED: crowd assisted data extraction

A formalized framework for incorporating expert labels in crowdsourcing environment

Cost‐effective multi‐instance multilabel active learning